Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2917

Split a tablet into primary key ranges by number of row

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • perf, spark

    Description

      Since we implemented KUDU-2437 and KUDU-2670, the spark job can read data inside the tablet in parallel. However, we found in actual use that splitting key range by size may cause the spark task to read long tails. (Some tasks read more data when the data size in KeyRange is basically the same.)

      I think this issue is caused by the encoding and compression of column-wise. For example, we store 1000 rows of data in column-wise. If most of these columns have the same values, less storage space is required. Instead, If these columns have different values, more storage is needed. So I think maybe split the primary key range by the number of rows might be a good choice.

      Attachments

        Issue Links

          Activity

            People

              oclarms Xu Yao
              oclarms Xu Yao
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: