Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-8425

Directory pruning issue with queries including joins.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.21.0, 1.19.0
    • None
    • Functions - Drill
    • None

    Description

      Performance degradation base on the number of files present in the directory structure when using the same query on one day of data

      I'm using partitioned directories
      ./product/year/month/day
      ./command/year/month/day
       each contain a particular  parquet file. (tested with csv as well)
      If I query a table for one day, say select * from dfs.root.product where dir0 = 2023 and dir1 = 04 and dir2 = 12; then only the file located in  ./product/year/month/day/product.parquet is accessed (as expected)
      Now if I do a join query between product and command for a particular day

      SELECT p.field1 , p.field2, c.field2 FROM dfs.root.command as c
      LEFT JOIN dfs.root.product as p
      on p.field1 = c.field1
      where p.dir0 = 2023
      and p.dir1 = 04
      and p.dir2 = 12
      and c.dir0 = 2023
      and c.dir1 = 04
      and c.dir2 = 12;

      I can see in the log (debug mode) that all the directory structures is scanned and not just the 2 concerned files
      so the more file (year month) you have in the DFS the more heap memory you use and the more time it takes to get the results

      (posted in slack channel (https://apache-drill.slack.com/archives/CG380K519/p1681335761429099)

      Attachments

        Activity

          People

            Unassigned Unassigned
            loy2 Loy2
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: