[IMPALA-12861] File formats are confused when Iceberg tables has mixed formats - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 4.3.0
Fix Version/s: None
Component/s: Frontend
Labels:
- impala-iceberg

Epic Color:
ghx-label-4

Description

Repro steps:
create table mixed_ice (i int, year int) partitioned by spec (year) stored as iceberg tblproperties('format-version'='2');

1) populate one partition with Impala (parquet)
insert into mixed_ice values (1, 2024), (2, 2024);

2) change the write format:
alter table mixed_ice set tblproperties ('write.format.default'='orc');

3) populate another partition with Hive (orc)
insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025);

4) then query just the parquet partition:
explain select * from mixed_ice where year = 2024;

| F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                                    |
| Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB thread-reservation=1      |
|   PLAN-ROOT SINK                                                                         |
|   |  output exprs: default.mixed_ice.i, default.mixed_ice.year                           |
|   |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB thread-reservation=0 |
|   |                                                                                      |
|   01:EXCHANGE [UNPARTITIONED]                                                            |
|      mem-estimate=16.00KB mem-reservation=0B thread-reservation=0                        |
|      tuple-ids=0 row-size=8B cardinality=2                                               |
|      in pipelines: 00(GETNEXT)                                                           |
|                                                                                          |
| F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1                                           |
| Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB thread-reservation=2    |
|   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED]                             |
|   |  mem-estimate=48.00KB mem-reservation=0B thread-reservation=0                        |
|   00:SCAN HDFS [default.mixed_ice, RANDOM]                                               |
|      HDFS partitions=1/1 files=1 size=602B                                               |
|      Iceberg snapshot id: 4964066258730898133                                            |
|      skipped Iceberg predicates: `year` = CAST(2024 AS INT)                              |
|      stored statistics:                                                                  |
|        table: rows=5 size=945B                                                           |
|        columns: unavailable                                                              |
|      extrapolated-rows=disabled max-scan-range-rows=5                                    |
|      file formats: [ORC, PARQUET]                                                        |
|      mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1                   |
|      tuple-ids=0 row-size=8B cardinality=2                                               |
|      in pipelines: 00(GETNEXT)                                                           |
+------------------------------------------------------------------------------------------+

Note, the file formats: [ORC, PARQUET] part even though this query only reads a parquet files.

Some analyis:
When IcebergScanNode is created it holds the correct information about file formats (Parquet).

Later on the parent class, HdfsScanNode also tries to populate the file formats here.]

It uses what getSampledOrRawPartitions() returns. In this use case the 'sampledPartitions_' is null, so will return 'partitions_'

Apparently, this 'partitions_' member holds the partition with the ORC file so it adds ORC to the fileFormats_. Unfortunately, this getSampledOrRawPartitions() is called in multiple locations within HdfsScanNode returning the wrong partition.

Next steps:

Check what other issues can this getSampledOrRawPartitions cause with multi file format tables. Also check if we can populate 'partitions_' properly.

File formats are confused when Iceberg tables has mixed formats

Details

Description

Attachments

Attachments

Activity

People

Dates