[HADOOP-14919] BZip2 drops records when reading data in splits - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.8.0, 2.7.3, 3.0.0-alpha1
Fix Version/s: 2.9.0, 2.8.3, 2.7.5, 3.0.0
Component/s: None
Labels:
None

Target Version/s:

2.9.0, 2.8.3, 2.7.5, 3.0.0
Hadoop Flags:

Reviewed

Description

BZip2 can drop records when reading data in splits. This problem was already discussed before in ~~HADOOP-11445~~ and ~~HADOOP-13270~~. But we still have a problem in corner case, causing lost data blocks.

I attached a unit test for this issue. You can reproduce the problem if you run the unit test.

First, this issue happens when position of newly created stream is equal to start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). However, the issue I am reporting does not happen when we run these tests because this issue happens only when the start of split byte block includes both block marker and compressed data.

BZip2 block marker - 0x314159265359 (001100010100000101011001001001100101001101011001)

blockEndingInCR.txt.bz2 (Start of Split - 136504):

$ xxd -l 6 -g 1 -b -seek 136498 ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
0021532: 00110001 01000001 01011001 00100110 01010011 01011001  1AY&SY

Test bz2 File (Start of Split - 203426)

$ xxd -l 7 -g 1 -b -seek 203419 250000.bz2
0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
0031aa1: 00101111                                               /

Let's say a job splits this test bz2 file into two splits at the start of split (position 203426).
The former split does not read records which start position 203426 because BZip2 says the position of these dropped records is 203427. The latter split does not read the records because BZip2CompressionInputStream read the block from position 320955.
Due to this behavior, records between 203427 and 320955 are lost.

Also, if we reverted the changes in ~~HADOOP-13270~~, we will not see this issue. We will see ~~HADOOP-13270~~ issue though.

Attachments

HADOOP-14919.001.patch
04/Oct/17 22:11
11 kB
Jason Darrell Lowe
250000.bz2
02/Oct/17 16:31
313 kB
Aki Tanaka
HADOOP-14919-test.patch
02/Oct/17 16:29
3 kB
Aki Tanaka

Issue Links

Add Link

is broken by

HADOOP-13270 BZip2CompressionInputStream finds the same compression marker twice in corner case, causing duplicate data blocks

Closed

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Jason Darrell Lowe Assign to me

Reporter:: Aki Tanaka

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 02/Oct/17 16:28

Updated:: 01/Feb/18 20:46

Resolved:: 31/Oct/17 14:38

Agile

View on Board

BZip2 drops records when reading data in splits

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment