[CALCITE-6329] Use weighted-average calculation for the column sizes in Union operator - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.37.0
Fix Version/s: None
Component/s: core
Labels:
None

Description

In the method averageColumnSizes(Union rel, RelMetadataQuery mq) of class RelMdSize, it uses a simple average for every column of the current Union operator to calculate column_size, so does row_size, which is calculated according to all the column sizes.

public List<Double> averageColumnSizes(Union rel, RelMetadataQuery mq) {
  final int fieldCount = rel.getRowType().getFieldCount();
  List<List<Double>> inputColumnSizeList = new ArrayList<>();
  for (RelNode input : rel.getInputs()) {
    final List<Double> inputSizes = mq.getAverageColumnSizes(input);
    if (inputSizes != null) {
      inputColumnSizeList.add(inputSizes);
    }
  }
  switch (inputColumnSizeList.size()) {
  case 0:
    return null; // all were null
  case 1:
    return inputColumnSizeList.get(0); // all but one were null
  }
  final ImmutableNullableList.Builder<Double> sizes =
      ImmutableNullableList.builder();
  int nn = 0;
  for (int i = 0; i < fieldCount; i++) {
    double d = 0d;
    int n = 0;
    for (List<Double> inputColumnSizes : inputColumnSizeList) {
      Double d2 = inputColumnSizes.get(i);
      if (d2 != null) {
        d += d2;
        ++n;
        ++nn;
      }
    }
    sizes.add(n > 0 ? d / n : null);
  }
  if (nn == 0) {
    return null; // all columns are null
  }
  return sizes.build();
}

But it doesn't take the rowCount of each input into account, which may introduce a bad case and make a bad impact on the downstream operators. for example:

# We have two tables A and B

# Logical Plan
ShuffleWrite
  Union
    TableScan(table=A)
    TableScan(table=B)

# stats
row_count(A) = 1E9, row_size(A) = 10
row_count(B) = 1E5, row_size(B) = 100
row_count(Union) = 1.0001E10, row_size(Union) = 55      # using simple average
row_count(ShuffleWrite) = row_count(Union) = 1.0001E10  # inherits from Union
row_size(ShuffleWrite) = row_size(Union) = 55           # inherits from Union


# cost estimation of ShuffleWrite, which is more larger than real input bytes
input_bytes(ShuffleWrite) = 55 * 1.0001E10 = 5.50055E11
input_bytes(Union) = 1E9 * 10 + 1E5 * 100 = 1.001E11

So I suggest that we can take row count of Union into consideration and use weighted average to calculate every column sizes and the final row size instead.

Attachments

Activity

People

Assignee:: mengdou

Reporter:: mengdou

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 14/Mar/24 02:50

Updated:: 14/Mar/24 02:50