Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9226

Improve string allocations of the ORC scanner

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • Impala 4.0.0, Impala 3.4.0
    • None

    Description

      Currently the ORC scanner allocates new memory for each string values (except for fixed size strings):

      https://github.com/apache/impala/blob/85425b81f04c856d7d5ec375242303f78ec7964e/be/src/exec/orc-column-readers.cc#L172

      Besides the too many allocations and copying it's also bad for memory locality.

      Since ORC-501 StringVectorBatch has a member named 'blob' that contains the strings in the batch: https://github.com/apache/orc/blob/branch-1.6/c%2B%2B/include/orc/Vector.hh#L126

      'blob' has type DataBuffer which is movable, so Impala might be able to get ownership of it. Or, at least we could copy the whole blob array instead of copying the strings one-by-one.

      ORC-501 is included in ORC version 1.6, but Impala currently only uses ORC 1.5.5.

      ORC 1.6 also introduces a new string vector type, EncodedStringVectorBatch:

      https://github.com/apache/orc/blob/e40b9a7205d51995f11fe023c90769c0b7c4bb93/c%2B%2B/include/orc/Vector.hh#L153

      It uses dictionary encoding for storing the values. Impala could copy/move the dictionary as well.

      Attachments

        Issue Links

          Activity

            People

              norbertluksa Norbert Luksa
              boroknagyz Zoltán Borók-Nagy
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: