[HIVE-23763] Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0-alpha-1
Component/s: Transactions
Labels:
- pull-request-available

Description

How to reproduce:

Create an unbucketed ACID table
Insert a bigger amount of data into this table so there would be multiple bucket files in the table
The files in the table should look like this:
/warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00000_0
/warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00001_0
/warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00002_0
/warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00003_0
/warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00004_0
/warehouse/tablespace/managed/hive/bubu_acid/delta_0000001_0000001_0000/bucket_00005_0
Do some delete on rows with different bucket Ids
The files in a delete delta should look like this:
/warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000002_0000002_0000/bucket_00000
/warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000006_0000006_0000/bucket_00003
/warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000006_0000006_0000/bucket_00001
Run the query-based minor compaction
After the compaction the newly created delete delta containes only 1 bucket file. This file contains rows from all buckets and the table becomes unusable
/warehouse/tablespace/managed/hive/bubu_acid/delete_delta_0000001_0000007_v0000066/bucket_00000

The issue happens only if rows with different bucket Ids are processed by the same FileSinkOperator.
In the FileSinkOperator.process method, the files for the compaction table are created like this:

    if (!bDynParts && !filesCreated) {
      if (lbDirName != null) {
        if (valToPaths.get(lbDirName) == null) {
          createNewPaths(null, lbDirName);
        }
      } else {
        if (conf.isCompactionTable()) {
          int bucketProperty = getBucketProperty(row);
          bucketId = BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty);
        }
        createBucketFiles(fsp);
      }
    }

When the first row is processed, the file is created and then the filesCreated variable is set to true. Then when the other rows are processed, the first if statement will be false, so no new file gets created, but the row will be written into the file created for the first row.

Attachments

Issue Links

is related to

HIVE-24015 Disable query-based compaction on MR execution engine

Closed

links to

GitHub Pull Request #1327

Activity

People

Assignee:: Marta Kuczora

Reporter:: Marta Kuczora

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Jun/20 14:51

Updated:: 17/Nov/22 08:47

Resolved:: 04/Aug/20 10:51

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 10m