Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.5.0
-
None
Description
When I create a dataset from pandas data frame and persisting it (DISK_ONLY), some "byte[]" objects (total size of imported data frame) will still remain in the driver's heap memory.
This is the sample code for reproducing it:
import pandas as pd import gc from pyspark.sql import SparkSession from pyspark.storagelevel import StorageLevel spark = SparkSession.builder \ .config("spark.driver.memory", "4g") \ .config("spark.executor.memory", "4g") \ .config("spark.sql.execution.arrow.pyspark.enabled", "true") \ .getOrCreate() pdf = pd.read_pickle('tmp/input.pickle') df = spark.createDataFrame(pdf) df = df.persist(storageLevel=StorageLevel.DISK_ONLY) df.count() del pdf del df gc.collect() spark.sparkContext._jvm.System.gc()
After running this code, I will perform a manual GC in VisualVM, but the driver memory usage will remain at 550 MBs (at start it was about 50 MBs).
Then I tested with adding "df = df.unpersist()" after the "df.count()" line and everything was OK (Memory usage after performing manual GC was about 50 MBs).
Also, I tried with reading from parquet file (without adding unpersist line) with this code:
import gc from pyspark.sql import SparkSession from pyspark.storagelevel import StorageLevel spark = SparkSession.builder \ .config("spark.driver.memory", "4g") \ .config("spark.executor.memory", "4g") \ .config("spark.sql.execution.arrow.pyspark.enabled", "true") \ .getOrCreate() df = spark.read.parquet('tmp/input.parquet') df = df.persist(storageLevel=StorageLevel.DISK_ONLY) df.count() del df gc.collect() spark.sparkContext._jvm.System.gc()
Again everything was fine and memory usage was about 50 MBs after performing manual GC.