Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-6329

Use weighted-average calculation for the column sizes in Union operator

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.37.0
    • None
    • core
    • None

    Description

      In the method averageColumnSizes(Union rel, RelMetadataQuery mq) of class RelMdSize, it uses a simple average for every column of the current Union operator to calculate column_size, so does row_size, which is calculated according to all the column sizes.

      public List<Double> averageColumnSizes(Union rel, RelMetadataQuery mq) {
        final int fieldCount = rel.getRowType().getFieldCount();
        List<List<Double>> inputColumnSizeList = new ArrayList<>();
        for (RelNode input : rel.getInputs()) {
          final List<Double> inputSizes = mq.getAverageColumnSizes(input);
          if (inputSizes != null) {
            inputColumnSizeList.add(inputSizes);
          }
        }
        switch (inputColumnSizeList.size()) {
        case 0:
          return null; // all were null
        case 1:
          return inputColumnSizeList.get(0); // all but one were null
        }
        final ImmutableNullableList.Builder<Double> sizes =
            ImmutableNullableList.builder();
        int nn = 0;
        for (int i = 0; i < fieldCount; i++) {
          double d = 0d;
          int n = 0;
          for (List<Double> inputColumnSizes : inputColumnSizeList) {
            Double d2 = inputColumnSizes.get(i);
            if (d2 != null) {
              d += d2;
              ++n;
              ++nn;
            }
          }
          sizes.add(n > 0 ? d / n : null);
        }
        if (nn == 0) {
          return null; // all columns are null
        }
        return sizes.build();
      } 

      But it doesn't take the rowCount of each input into account, which may introduce a bad case and make a bad impact on the downstream operators. for example:

       

      # We have two tables A and B
      
      # Logical Plan
      ShuffleWrite
        Union
          TableScan(table=A)
          TableScan(table=B)
      
      # stats
      row_count(A) = 1E9, row_size(A) = 10
      row_count(B) = 1E5, row_size(B) = 100
      row_count(Union) = 1.0001E10, row_size(Union) = 55      # using simple average
      row_count(ShuffleWrite) = row_count(Union) = 1.0001E10  # inherits from Union
      row_size(ShuffleWrite) = row_size(Union) = 55           # inherits from Union
      
      
      # cost estimation of ShuffleWrite, which is more larger than real input bytes
      input_bytes(ShuffleWrite) = 55 * 1.0001E10 = 5.50055E11
      input_bytes(Union) = 1E9 * 10 + 1E5 * 100 = 1.001E11

       

      So I suggest that we can take row count of Union into consideration and use weighted average to calculate every column sizes and the final row size instead.

       

       

      Attachments

        Activity

          People

            mengdou mengdou
            mengdou mengdou
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: