Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47836

Performance problem with QuantileSummaries

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.1
    • None
    • SQL
    • None

    Description

      SPARK-29336 caused a severe performance regression.
      In practice a partial_aggregate with several approx_percentile calls ran less than hour and the final aggrergation after exchange would have taken over a week.
      Simple percentile ran about the same time in the first part and the final aggregate ran very quickly.

      I made a benchmark, and it reveals that the merge operation is very-very slow: https://github.com/tanelk/spark/commit/3b16f429a77b10003572295f42361fbfb2f3c63e
      From my experiments it looks like it is n^2 with the number of partitions (number of partial aggregations to merge).
      When I reverted the changes made in this PR, then the "Only insert" and "Insert & merge" were very similar.

      The cause seems to be, that compressImmut does not reduce the number samples allmost at all after merges and just keeps iterating over an evergrowing list.
      I was not able to figure out how to fix the issue without just reverting the PR.

      I also added a benchmark with KllDoublesSketch from the apache datasketches project and it worked even better than this class before this PR.
      Only downside was that it is not-deterministic.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tanelk Tanel Kiis
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: