Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29562

SQLAppStatusListener metrics aggregation is slow and memory hungry

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.4
    • 3.0.0
    • SQL
    • None

    Description

      While SQLAppStatusListener was added in 2.3, the aggregation code is very similar to what it was previously, so I'm sure this is even older.

      Long story short, the aggregation code (SQLAppStatusListener.aggregateMetrics) is very, very slow, and can take a non-trivial amount of time with large queries, aside from using a ton of memory.

      There are also cascading issues caused by that: since it's called from an event handler, it can slow down event processing, causing events to be dropped, which can cause listeners to miss important events that would tell them to free up internal state (and, thus, memory).

      To given an anecdotal example, one app I looked at ran into the "events being dropped" issue, which caused the listener to accumulate state for 100s of live stages, even though most of them were already finished. That lead to a few GB of memory being wasted due to finished stages that were still being tracked.

      Here, though, I'd like to focus on SQLAppStatusListener.aggregateMetrics and making it faster. We should look at the other issues (unblocking event processing, cleaning up of stale data in listeners) separately.

      (I also remember someone in the past trying to fix something in this area, but couldn't find a PR nor an open bug.)

      Attachments

        Issue Links

          Activity

            People

              vanzin Marcelo Masiero Vanzin
              vanzin Marcelo Masiero Vanzin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: