[BEAM-3484] HadoopInputFormatIO reads big datasets invalid - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: P3
Resolution: Fixed
Affects Version/s: 2.3.0, 2.4.0
Fix Version/s: 2.5.0
Component/s: io-java-hadoop-format
Labels:
None

Description

For big datasets HadoopInputFormat sometimes skips/duplicates elements from database in resulting PCollection. This gives incorrect read result.

Occurred to me while developing HadoopInputFormatIOIT and running it on dataflow. For datasets smaller or equal to 600 000 database rows I wasn't able to reproduce the issue. Bug appeared only for bigger sets, eg. 700 000, 1 000 000.

Attachments:
- text file with sorted HadoopInputFormat.read() result saved using TextIO.write().to().withoutSharding(). If you look carefully you'll notice duplicates or missing values that should not happen

- same text file for 600 000 records not having any duplicates and missing elements

link to a PR with HadoopInputFormatIO integration test that allows to reproduce this issue. At the moment of writing, this code is not merged yet.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

result_sorted1000000
16/Jan/18 14:23
13.25 MB
Lukasz Gajowy
result_sorted600000
16/Jan/18 14:23
7.90 MB
Lukasz Gajowy

Issue Links

links to

GitHub Pull Request #5166

PR with HadoopInputFormatIOIT to reproduce the issue

Activity

People

Assignee:: Alexey Romanenko

Reporter:: Lukasz Gajowy

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/Jan/18 14:22

Updated:: 16/May/20 13:22

Resolved:: 19/Apr/18 20:05

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 20m