Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29321

Possible memory leak in Spark

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.3.3
    • None
    • Spark Core
    • None
    • Important

    Description

      This issue is a clone of the (SPARK-29055). After Spark version 2.3.3, I observe that the JVM memory is increasing slightly overtime. This behavior also affects the application performance because when I run my real application in testing environment, after a while the persisted dataframes stop fitting into the executors memory and I have spill to disk.

      JVM memory usage (based on htop command)

      Time RES SHR MEM%
      1min 1349 32724 1.5
      3min 1936 32724 2.2
      5min 2506 32724 2.6
      7min 2564 32724 2.7
      9min 2584 32724 2.7
      11min 2585 32724 2.7
      13min 2592 32724 2.7
      15min 2591 32724 2.7
      17min 2591 32724 2.7
      30min 2600 32724 2.7
      1h 2618 32724 2.7

       

      HOW TO REPRODUCE THIS BEHAVIOR:

       Reproduce the above behavior, by running the snippet code (I prefer to run without any sleeping delay) and track the JVM memory with top or htop command.

      import time
      import os
      
      from pyspark.sql import SparkSession
      from pyspark.sql import functions as F
      from pyspark.sql import types as T
      
      target_dir = "..."
      
      spark=SparkSession.builder.appName("DataframeCount").getOrCreate()
      
      while True:
          for f in os.listdir(target_dir):
              df = spark.read.load(target_dir + f, format="csv")
              print("Number of records: {0}".format(df.count()))
              time.sleep(15)

       

      TESTED CASES WITH THE SAME BEHAVIOUR:

      • I tested with default settings (spark-defaults.conf)
      • Add spark.cleaner.periodicGC.interval 1min (or less)
      • Turn spark.cleaner.referenceTracking.blocking=false
      • Run the application in cluster mode
      • Increase/decrease the resources of the executors and driver
      • I tested with extraJavaOptions in driver and executor -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12
      • It is also tested with the Spark 2.4.4 (latest) and had the same behavior.
         

      DEPENDENCIES

      • Operation system: Ubuntu 16.04.3 LTS
      • Java: jdk1.8.0_131 (tested also with jdk1.8.0_221)
      • Python: Python 2.7.12

      Attachments

        1. sparkVisualGC.PNG
          151 kB
          George Papa
        2. Screen Shot 2019-10-20 at 10.55.03 PM.png
          260 kB
          Jungtaek Lim

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Geopap George Papa
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: