Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29795

Possible 'leak' of Metrics with dropwizard metrics 4.x

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • Spark Core
    • None

    Description

      This one's a little complex to explain.

      SPARK-29674 updated dropwizard metrics to 4.x, for Spark 3 only. That appears to be fine, according to tests. We have not and do not intend to backport it to 2.4.x.

      However, I'm working with a few people trying to back-port this to Spark 2.4.x separately. When this update is applied, tests fail readily with OutOfMemoryError, typically around ExternalShuffleServiceSuite in core. A heap dump analysis shows that MetricRegistry objects are retaining a gigabyte or more of memory.

      It appears to be holding references to many large internal Spark objects like BlockManager and Netty objects, via closures we pass to Gauge objects. Although it looked odd, this may or may not be an issue; in normal usage where a JVM hosts one SparkContext, this may normal.

      However in tests where contexts are started/restarted repeatedly, it seems like this might 'leak' old references to old context-related objects across runs via metrics. I don't have a clear theory on how yet (is SparkEnv shared or some ref held to it?), besides the empirical evidence. However, it's also not clear why this wouldn't affect Spark 3, apparently, as tests work fine. It could be another fix in Spark 3 that happens to help here; it could be that Spark 3 uses less memory and never hits the issue.

      Despite that uncertainty, I've found that simply clearing the registered metrics from MetricsSystem when it is stop()-ped seems to resolve the issue. At this point, Spark is shutting down and sinks have stopped, so there doesn't seem to be any harm in manually releasing all registered metrics and objects. I don't think it's intended to track metrics across two instantiations of a SparkContext in the same JVM, but that's a question.

      That's the change I will propose in a PR.

      Why does this not happen in 2.4 + metrics 3.x? unclear. We've not seen any test failures like this in 2.4 or reports of problems with metrics-related memory pressure. It could be a change in how 4.x behaves, tracks objects, manages lifecycles.

      The difference does not seem to be Scala 2.11 vs 2.12, by the way. 2.4 works fine on both without the 4.x update; runs out of memory on both with the change.

      Why do this if this only affects 2.4 + metrics 4.x and we're not moving to metrics 4.x in 2.4? It could still be a smaller issue in Spark 3, not detected by tests. It may help apps that do for various reasons run multiple SparkContexts per JVM - like many other project test suites. It may just be good for tidiness in shutdown, to manually clear resources.

      Therefore I can't call this a bug per se, maybe an improvement.

      Attachments

        Issue Links

          Activity

            People

              srowen Sean R. Owen
              srowen Sean R. Owen
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: