[SPARK-29795] Possible 'leak' of Metrics with dropwizard metrics 4.x - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 3.0.0
Component/s: Spark Core
Labels:
None

Description

This one's a little complex to explain.

~~SPARK-29674~~ updated dropwizard metrics to 4.x, for Spark 3 only. That appears to be fine, according to tests. We have not and do not intend to backport it to 2.4.x.

However, I'm working with a few people trying to back-port this to Spark 2.4.x separately. When this update is applied, tests fail readily with OutOfMemoryError, typically around ExternalShuffleServiceSuite in core. A heap dump analysis shows that MetricRegistry objects are retaining a gigabyte or more of memory.

It appears to be holding references to many large internal Spark objects like BlockManager and Netty objects, via closures we pass to Gauge objects. Although it looked odd, this may or may not be an issue; in normal usage where a JVM hosts one SparkContext, this may normal.

However in tests where contexts are started/restarted repeatedly, it seems like this might 'leak' old references to old context-related objects across runs via metrics. I don't have a clear theory on how yet (is SparkEnv shared or some ref held to it?), besides the empirical evidence. However, it's also not clear why this wouldn't affect Spark 3, apparently, as tests work fine. It could be another fix in Spark 3 that happens to help here; it could be that Spark 3 uses less memory and never hits the issue.

Despite that uncertainty, I've found that simply clearing the registered metrics from MetricsSystem when it is stop()-ped seems to resolve the issue. At this point, Spark is shutting down and sinks have stopped, so there doesn't seem to be any harm in manually releasing all registered metrics and objects. I don't think it's intended to track metrics across two instantiations of a SparkContext in the same JVM, but that's a question.

That's the change I will propose in a PR.

Why does this not happen in 2.4 + metrics 3.x? unclear. We've not seen any test failures like this in 2.4 or reports of problems with metrics-related memory pressure. It could be a change in how 4.x behaves, tracks objects, manages lifecycles.

The difference does not seem to be Scala 2.11 vs 2.12, by the way. 2.4 works fine on both without the 4.x update; runs out of memory on both with the change.

Why do this if this only affects 2.4 + metrics 4.x and we're not moving to metrics 4.x in 2.4? It could still be a smaller issue in Spark 3, not detected by tests. It may help apps that do for various reasons run multiple SparkContexts per JVM - like many other project test suites. It may just be good for tidiness in shutdown, to manually clear resources.

Therefore I can't call this a bug per se, maybe an improvement.

Attachments

Issue Links

relates to

SPARK-29674 Update dropwizard metrics to 4.1.1 for JDK 9+ support

Resolved

links to

GitHub Pull Request #26427

Activity

People

Assignee:: Sean R. Owen

Reporter:: Sean R. Owen

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 07/Nov/19 17:20

Updated:: 09/Nov/19 20:15

Resolved:: 09/Nov/19 14:48