Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
Impala 3.2.0
-
None
-
ghx-label-2
Description
computeMinScalarColumnMemReservation() uses stat avg_size to estimate the memory needed for a value during scanning, but this does not contain the 4 byte / value length field used in plain encoding, which can dominate columns with very short strings. (compression can probably negate this affect)
In case of dict decoding estimation:
- this 4 byte/NDV should be also added, as the dictionary itself is also plain encoded
- the backend used + 12 byte/NDV for the StringValues used as indirection in the dictionary, but I am not sure if this should be added to the reservation
- a more pessimistic estimation would use max_size instead of avg_size for dictionary entries, as it is possible that the majority of distinct values are long, but the short ones are much more frequent, which makes the avg_size small
Another small underestimation, that NULL values are ignored. NULLs (=def levels) could be added as 1 bit/value.