Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.0
-
None
-
None
Description
I recently came across an issue regarding compacting tables with sorting.
I am creating and populating with test data two tables: both ACID but only one is sorted
USE priv; DROP TABLE IF EXISTS test_data; DROP TABLE IF EXISTS test_compact_insert_with_sorting; DROP TABLE IF EXISTS test_compact_insert_without_sorting; CREATE TABLE test_data AS SELECT 'foobar' col; CREATE TABLE test_compact_insert_with_sorting (col string) CLUSTERED BY (col) SORTED BY (col) INTO 1 BUCKETS TBLPROPERTIES ('transactional' = 'true', 'transactional_properties'='insert_only'); CREATE TABLE test_compact_insert_without_sorting (col string) CLUSTERED BY (col) INTO 1 BUCKETS TBLPROPERTIES ('transactional' = 'true', 'transactional_properties'='insert_only'); INSERT OVERWRITE TABLE test_compact_insert_with_sorting SELECT col FROM test_data; INSERT OVERWRITE TABLE test_compact_insert_without_sorting SELECT col FROM test_data; INSERT OVERWRITE TABLE test_compact_insert_with_sorting SELECT col FROM test_data; INSERT OVERWRITE TABLE test_compact_insert_without_sorting SELECT col FROM test_data;
As expected, after these operations two base files were created for each table:
$ hdfs dfs -ls /warehouse/tablespace/managed/hive/priv.db/test_compact_insert* Found 2 items drwxrwx---+ - hive hadoop 0 2019-09-18 15:08 /warehouse/tablespace/managed/hive/priv.db/test_compact_insert_with_sorting/base_0000001 drwxrwx---+ - hive hadoop 0 2019-09-18 15:08 /warehouse/tablespace/managed/hive/priv.db/test_compact_insert_with_sorting/base_0000002 Found 2 items drwxrwx---+ - hive hadoop 0 2019-09-18 15:08 /warehouse/tablespace/managed/hive/priv.db/test_compact_insert_without_sorting/base_0000001 drwxrwx---+ - hive hadoop 0 2019-09-18 15:08 /warehouse/tablespace/managed/hive/priv.db/test_compact_insert_without_sorting/base_0000002
But after running manual compaction on those tables:
USE priv; ALTER TABLE test_compact_insert_with_sorting COMPACT 'MAJOR'; ALTER TABLE test_compact_insert_without_sorting COMPACT 'MAJOR';
Tuns out only the one without sorting got compacted:
hdfs dfs -ls /warehouse/tablespace/managed/hive/priv.db/test_compact* Found 2 items drwxrwx---+ - hive hadoop 0 2019-09-18 15:08 /warehouse/tablespace/managed/hive/priv.db/test_compact_insert_with_sorting/base_0000001 drwxrwx---+ - hive hadoop 0 2019-09-18 15:08 /warehouse/tablespace/managed/hive/priv.db/test_compact_insert_with_sorting/base_0000002 Found 1 items drwxrwx---+ - hive hadoop 0 2019-09-18 15:08 /warehouse/tablespace/managed/hive/priv.db/test_compact_insert_without_sorting/base_0000002
Compactions inspection returns:
$ beeline -e 'show compactions' | grep priv | grep test_compact
| 7598474 | priv | test_compact_insert_with_sorting | --- | MAJOR | succeeded | master-01.pd.my-domain.com.pl-51 | 1568812155386 | 11 | None |
| 7598475 | priv | test_compact_insert_without_sorting | --- | MAJOR | succeeded | --- | 1568812155403 | 298 | None
Is this by design? Both compactions states are 'succeeded' but only the one that resulted in reducing number of base files took some time. Another remarkable behavior is compaction of the table with sorting has worker assigned meaning it is still in progress?