Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • None

    Description

      It will be useful for Apache Beam if we support the following SqlHint:

      SELECT * FROM t
      GROUP BY t.key_column /* + hot_key(key1=fanout_factor, ...) */)

      The hot key strategy works on aggregation and it provides a list of hot keys with fanout factor for a column. The fanout factor says how many partition should be created for that specific key, such that we can have a per partition aggregate and then have a final aggregate. One example to explain it:

      SELECT * FROM t
      GROUP BY t.key_column /* + hot_key("value1"=2) */)

      // for the key_column, there is a "value1" which appear so many times (so it's hot), please consider split it into two partition and process separately.

      Such problem is common for big data processing, where hot key creates slowest machine which either slow down the whole pipeline or make retries. In such case, one common resolution is to split data to multiple partition and aggregate per partition, and then have a final combine.

      Usually execution engine won't know what is the hot key(s). SqlHint provides a good way to tell the engine which key is useful to deal with it.

      Attachments

        Issue Links

          Activity

            People

              amaliujia Rui Wang
              amaliujia Rui Wang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h