Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-12765

Balance consecutive partitions better for Iceberg tables

    XMLWordPrintableJSON

Details

    • ghx-label-8

    Description

      During scheduling Impala does the following:

      • Non-Iceberg tables
        • The scheduler processes the scan ranges in partition key order
        • The scheduler selects N replicas as candidates
        • The scheduler chooses the executor from the candidates based on minimum number of assigned bytes
        • So consecutive partitions are more likely to be assigned to different executors
      • Iceberg tables
        • The scheduler processes the scan ranges in random order
        • The scheduler selects N replicas as candidates
        • The scheduler chooses the executor from the candidates based on minimum number of assigned bytes
        • So consecutive partitions (by partition key order) are assigned randomly, i.e. there's a higher chances of clustering

      If the IcebergScanNode ordered its file descriptors based on their paths we would have a more balanced scheduling for consecutive partitions. Queries that operate on a range of partitions are quite common, so it makes sense to optimize that case.

      Attachments

        Activity

          People

            boroknagyz Zoltán Borók-Nagy
            boroknagyz Zoltán Borók-Nagy
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: