[SPARK-45373] Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Add vote

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

Delete

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.5.0
Fix Version/s: None
Component/s: SQL
Labels:
- pull-request-available

Target Version/s:

3.5.1

Description

In the rule PruneFileSourcePartitions where the CatalogFileIndex gets converted to InMemoryFileIndex, the HMS calls can get very expensive if :
1) The translated filter string for push down to HMS layer becomes empty , resulting in fetching of all partitions and same table is referenced multiple times in the query.
2) Or just in case same table is referenced multiple times in the query with different partition filters.
In such cases current code would result in multiple calls to HMS layer.
This can be avoided by grouping the tables based on CatalogFileIndex and passing a common minimum filter ( filter1 || filter2) and getting a base PrunedInmemoryFileIndex which can become a basis for each of the specific table.

Opened following PR for ticket:
SPARK-45373-PR