Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23054

Capture Total Byte Size in Column Statistics

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • CBO, Statistics
    • None

    Description

      Store a counter in HMS column statics for the total number of bytes (raw) in each column.

      Right now, there is no good way to merge the average column length when performing an INSERT statement into a table. Right now, the code just selects the maximum value, however, if inserting a single records with a long length (128 bytes) into a table that has millions of strings with an average length of 4, the average length for the entire data set gets boosted to 128.

      aggregateData.setAvgColLen(Math.max(aggregateData.getAvgColLen(), newData.getAvgColLen()));
      

      https://github.com/apache/hive/blob/e182d9ce6c09136d13ee889ef069b202f60052ec/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/merge/StringColumnStatsMerger.java#L34

      Store the total raw size of all the data in each column. Between the total raw size, and the average length, one can compute the real average length when merging the existing data and the newly inserted data.

      Attachments

        Activity

          People

            Unassigned Unassigned
            belugabehr David Mollitor
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: