[IMPALA-12894] Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: Impala 4.3.0
Fix Version/s: Impala 4.4.0
Component/s: Frontend
Labels:
- correctness
- impala-iceberg

Epic Color:
ghx-label-8

Description

Issue was introduced by https://issues.apache.org/jira/browse/IMPALA-11802 that implemented an optimized way to get results for count. However, if the table was compacted by Spark this optimization can give incorrect results.

The reason is that Spark can[ skip dropping delete files|https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_position_delete_files] that are pointing to compacted data files, as a result there might be delete files after compaction that are no longer applied to any data files.

Repro:

With Impala

create table default.iceberg_testing (id int, j bigint) STORED AS ICEBERG
TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
              'iceberg.catalog_location'='/tmp/spark_iceberg_catalog/',
              'iceberg.table_identifier'='iceberg_testing',
              'format-version'='2');
insert into iceberg_testing values
    (1, 1), (2, 4), (3, 9), (4, 16), (5, 25);
update iceberg_testing set j = -100 where id = 4;
delete from iceberg_testing where id = 4;

Count * returns 4 at this point.

Run compaction in Spark:

spark.sql(s"CALL local.system.rewrite_data_files(table => 'default.iceberg_testing', options => map('min-input-files','2') )").show()

Now count * in Impala returns 8 (might require an IM if in HadoopCatalog). Hive returns correct results. Also a SELECT * returns correct results.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

count_star_correctness_repro.tar.gz
12/Mar/24 14:16
9 kB
Gabor Kaszab

Activity

People

Assignee:: Zoltán Borók-Nagy

Reporter:: Gabor Kaszab

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Mar/24 14:08

Updated:: 16/Apr/24 16:43

Resolved:: 16/Apr/24 16:43