[BEAM-12654] FileIO can produce duplicates in output files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: P3
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: beam-model, io-java-files, runner-flink, runner-spark, sdk-java-core
Labels:
None

Description

FileIO can produce duplicates in output files - depending on a runner.

Concrete example for Spark when executing as batch:

When using FileIO with specific number of shards, it will use default sharding function which is a round robin shard assignment with random seed. In multistage pipeline, data between stages are hold by shuffle service until downstream stage request it for further computations. If shuffle results computed with this seeded shard function are lost - e.g. shuffle service fails because of HW error - then Spark will attempt to recover data by computing them again from source data. As a result of a random seed sharding, this will assign different shard - and therefore key to the element.

More details are discussed in this thread:
https://lists.apache.org/thread.html/r5e91d1996479defbf5e896dca3cf237ee2d9b59396cb3c4edf619df1%40%3Cdev.beam.apache.org%3E

Attachments

Issue Links

split from

BEAM-12493 FileIO should allow to opt-in for custom sharding function

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Jozef Vilcek

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 23/Jul/21 06:47

Updated:: 04/Jun/22 21:03