Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 4.3.0
-
ghx-label-8
Description
Issue was introduced by https://issues.apache.org/jira/browse/IMPALA-11802 that implemented an optimized way to get results for count. However, if the table was compacted by Spark this optimization can give incorrect results.
The reason is that Spark can[ skip dropping delete files|https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_position_delete_files] that are pointing to compacted data files, as a result there might be delete files after compaction that are no longer applied to any data files.
Repro:
With Impala
create table default.iceberg_testing (id int, j bigint) STORED AS ICEBERG TBLPROPERTIES('iceberg.catalog'='hadoop.catalog', 'iceberg.catalog_location'='/tmp/spark_iceberg_catalog/', 'iceberg.table_identifier'='iceberg_testing', 'format-version'='2'); insert into iceberg_testing values (1, 1), (2, 4), (3, 9), (4, 16), (5, 25); update iceberg_testing set j = -100 where id = 4; delete from iceberg_testing where id = 4;
Count * returns 4 at this point.
Run compaction in Spark:
spark.sql(s"CALL local.system.rewrite_data_files(table => 'default.iceberg_testing', options => map('min-input-files','2') )").show()
Now count * in Impala returns 8 (might require an IM if in HadoopCatalog). Hive returns correct results. Also a SELECT * returns correct results.