Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-12894

Optimized count(*) for Iceberg gives wrong results after a Spark rewrite_data_files

    XMLWordPrintableJSON

Details

    • ghx-label-8

    Description

      Issue was introduced by https://issues.apache.org/jira/browse/IMPALA-11802 that implemented an optimized way to get results for count. However, if the table was compacted by Spark this optimization can give incorrect results.

      The reason is that Spark can[ skip dropping delete files|https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_position_delete_files] that are pointing to compacted data files, as a result there might be delete files after compaction that are no longer applied to any data files.

      Repro:

      With Impala

      create table default.iceberg_testing (id int, j bigint) STORED AS ICEBERG
      TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
                    'iceberg.catalog_location'='/tmp/spark_iceberg_catalog/',
                    'iceberg.table_identifier'='iceberg_testing',
                    'format-version'='2');
      insert into iceberg_testing values
          (1, 1), (2, 4), (3, 9), (4, 16), (5, 25);
      update iceberg_testing set j = -100 where id = 4;
      delete from iceberg_testing where id = 4;

      Count * returns 4 at this point.

      Run compaction in Spark:

      spark.sql(s"CALL local.system.rewrite_data_files(table => 'default.iceberg_testing', options => map('min-input-files','2') )").show() 

      Now count * in Impala returns 8 (might require an IM if in HadoopCatalog). Hive returns correct results. Also a SELECT * returns correct results.

      Attachments

        Activity

          People

            boroknagyz Zoltán Borók-Nagy
            gaborkaszab Gabor Kaszab
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: