Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6120

DecisionTree.save uses too much Java heap space for default spark shell settings

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.3.0
    • 1.3.0
    • MLlib
    • None

    Description

      When the Python DecisionTree example in the programming guide is run, it runs out of Java Heap Space:

      scala> model.save(sc, "myModelPath")
      [Stage 12:>                                                                                                                                        (0 + 8) / 8]15/03/02 14:19:16 ERROR Executor: Exception in task 1.0 in stage 12.0 (TID 22)
      java.lang.OutOfMemoryError: Java heap space
      	at parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
      	at parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57)
      	at parquet.column.values.plain.PlainValuesWriter.<init>(PlainValuesWriter.java:45)
      	at parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:102)
      	at parquet.column.values.dictionary.DictionaryValuesWriter$PlainDoubleDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:471)
      	at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:111)
      	at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74)
      	at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
      	at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
      	at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
      	at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
      	at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
      	at parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
      	at parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
      	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
      	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
      	at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:620)
      	at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641)
      	at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
      	at org.apache.spark.scheduler.Task.run(Task.scala:64)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      

      When saving using JSON format instead of Parquet, this works. It seems to be caused by Parquet requiring a lot of metadata to describe the schema.

      I'm labeling this a bug since it should succeed with the default spark-shell settings. Potential fixes are:

      • increasing spark-shell default heap space settings (This is probably too hard to agree on currently.)
      • not using Parquet for storage (This would be good for small examples but probably worse for large models, where Parquet would be more efficient than other formats.)
      • compressing the schema (The various values in the DecisionTree model could be flattened into a single Seq of Double. This may be the best option for now.)

      Notes:

      • This happens in both pyspark and Scala shells.
      • Increasing driver memory to 1g (from the default of 512m) makes this succeed.
      • Running other examples such as NaiveBayes with the default of 512m works.
      • This is a bit strange in that the actual size of the saved model on disk is small (86K on disk for me).

      Attachments

        Issue Links

          Activity

            People

              josephkb Joseph K. Bradley
              josephkb Joseph K. Bradley
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: