[DRILL-8425] Directory pruning issue with queries including joins. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.21.0, 1.19.0
Fix Version/s: None
Component/s: Functions - Drill
Labels:
None

Description

Performance degradation base on the number of files present in the directory structure when using the same query on one day of data

I'm using partitioned directories
./product/year/month/day
./command/year/month/day
each contain a particular parquet file. (tested with csv as well)
If I query a table for one day, say select * from dfs.root.product where dir0 = 2023 and dir1 = 04 and dir2 = 12; then only the file located in ./product/year/month/day/product.parquet is accessed (as expected)
Now if I do a join query between product and command for a particular day

SELECT p.field1 , p.field2, c.field2 FROM dfs.root.command as c
LEFT JOIN dfs.root.product as p
on p.field1 = c.field1
where p.dir0 = 2023
and p.dir1 = 04
and p.dir2 = 12
and c.dir0 = 2023
and c.dir1 = 04
and c.dir2 = 12;

I can see in the log (debug mode) that all the directory structures is scanned and not just the 2 concerned files
so the more file (year month) you have in the DFS the more heap memory you use and the more time it takes to get the results

(posted in slack channel (https://apache-drill.slack.com/archives/CG380K519/p1681335761429099)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Loy2

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 18/Apr/23 15:03

Updated:: 09/Jun/23 19:41