Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.21.0, 1.19.0
-
None
-
None
Description
Performance degradation base on the number of files present in the directory structure when using the same query on one day of data
I'm using partitioned directories
./product/year/month/day
./command/year/month/day
each contain a particular parquet file. (tested with csv as well)
If I query a table for one day, say select * from dfs.root.product where dir0 = 2023 and dir1 = 04 and dir2 = 12; then only the file located in ./product/year/month/day/product.parquet is accessed (as expected)
Now if I do a join query between product and command for a particular day
SELECT p.field1 , p.field2, c.field2 FROM dfs.root.command as c
LEFT JOIN dfs.root.product as p
on p.field1 = c.field1
where p.dir0 = 2023
and p.dir1 = 04
and p.dir2 = 12
and c.dir0 = 2023
and c.dir1 = 04
and c.dir2 = 12;
I can see in the log (debug mode) that all the directory structures is scanned and not just the 2 concerned files
so the more file (year month) you have in the DFS the more heap memory you use and the more time it takes to get the results
(posted in slack channel (https://apache-drill.slack.com/archives/CG380K519/p1681335761429099)