Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Impala 4.3.0
-
None
-
ghx-label-4
Description
Repro steps:
create table mixed_ice (i int, year int) partitioned by spec (year) stored as iceberg tblproperties('format-version'='2');
1) populate one partition with Impala (parquet)
insert into mixed_ice values (1, 2024), (2, 2024);
2) change the write format:
alter table mixed_ice set tblproperties ('write.format.default'='orc');
3) populate another partition with Hive (orc)
insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025);
4) then query just the parquet partition:
explain select * from mixed_ice where year = 2024;
| F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB thread-reservation=1 | | PLAN-ROOT SINK | | | output exprs: default.mixed_ice.i, default.mixed_ice.year | | | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB thread-reservation=0 | | | | | 01:EXCHANGE [UNPARTITIONED] | | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0 | | tuple-ids=0 row-size=8B cardinality=2 | | in pipelines: 00(GETNEXT) | | | | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 | | Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB thread-reservation=2 | | DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED] | | | mem-estimate=48.00KB mem-reservation=0B thread-reservation=0 | | 00:SCAN HDFS [default.mixed_ice, RANDOM] | | HDFS partitions=1/1 files=1 size=602B | | Iceberg snapshot id: 4964066258730898133 | | skipped Iceberg predicates: `year` = CAST(2024 AS INT) | | stored statistics: | | table: rows=5 size=945B | | columns: unavailable | | extrapolated-rows=disabled max-scan-range-rows=5 | | file formats: [ORC, PARQUET] | | mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1 | | tuple-ids=0 row-size=8B cardinality=2 | | in pipelines: 00(GETNEXT) | +------------------------------------------------------------------------------------------+
Note, the file formats: [ORC, PARQUET] part even though this query only reads a parquet files.
Some analyis:
When IcebergScanNode is created it holds the correct information about file formats (Parquet).
Later on the parent class, HdfsScanNode also tries to populate the file formats here.]
It uses what getSampledOrRawPartitions() returns. In this use case the 'sampledPartitions_' is null, so will return 'partitions_'
Apparently, this 'partitions_' member holds the partition with the ORC file so it adds ORC to the fileFormats_. Unfortunately, this getSampledOrRawPartitions() is called in multiple locations within HdfsScanNode returning the wrong partition.
Next steps:
Check what other issues can this getSampledOrRawPartitions cause with multi file format tables. Also check if we can populate 'partitions_' properly.