Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9875

Deduplicate build in joins with distinct semantics

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Backend
    • None
    • ghx-label-12

    Description

      For left semi and anti joins with only equi-join predicates, we don't need to store duplicates in the hash table, because a probe row will always match the first build row. We could rework the build process in PhjBuilder so that it builds the hash table on the fly and avoids insertion into the BufferedTupleStream if there is a match in the hash table. I.e. the build process would be closer to GroupingAggregator.

      An alternative approach to building the hash tables on the fly would be to use a bloom filter to track which rows are already present in the hash table. This would mean some duplicates might be kept.

      Some other joins like that in IMPALA-1706 also have distinct semantics, so maybe this could be applied there too to avoid exploding joins.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tarmstrong Tim Armstrong
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: