[SPARK-6120] DecisionTree.save uses too much Java heap space for default spark shell settings - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.0
Fix Version/s: 1.3.0
Component/s: MLlib
Labels:
None

Target Version/s:

1.3.0

Description

When the Python DecisionTree example in the programming guide is run, it runs out of Java Heap Space:

scala> model.save(sc, "myModelPath")
[Stage 12:>                                                                                                                                        (0 + 8) / 8]15/03/02 14:19:16 ERROR Executor: Exception in task 1.0 in stage 12.0 (TID 22)
java.lang.OutOfMemoryError: Java heap space
	at parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
	at parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57)
	at parquet.column.values.plain.PlainValuesWriter.<init>(PlainValuesWriter.java:45)
	at parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:102)
	at parquet.column.values.dictionary.DictionaryValuesWriter$PlainDoubleDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:471)
	at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:111)
	at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74)
	at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
	at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
	at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
	at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
	at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
	at parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
	at parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
	at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
	at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:620)
	at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641)
	at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
	at org.apache.spark.scheduler.Task.run(Task.scala:64)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

When saving using JSON format instead of Parquet, this works. It seems to be caused by Parquet requiring a lot of metadata to describe the schema.

I'm labeling this a bug since it should succeed with the default spark-shell settings. Potential fixes are:

increasing spark-shell default heap space settings (This is probably too hard to agree on currently.)
not using Parquet for storage (This would be good for small examples but probably worse for large models, where Parquet would be more efficient than other formats.)
compressing the schema (The various values in the DecisionTree model could be flattened into a single Seq of Double. This may be the best option for now.)

Notes:

This happens in both pyspark and Scala shells.
Increasing driver memory to 1g (from the default of 512m) makes this succeed.
Running other examples such as NaiveBayes with the default of 512m works.
This is a bit strange in that the actual size of the saved model on disk is small (86K on disk for me).

Attachments

Issue Links

is related to

SPARK-3071 Increase default driver memory

Closed

links to

[Github] Pull Request #4864 (jkbradley)

Activity

People

Assignee:: Joseph K. Bradley

Reporter:: Joseph K. Bradley

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Mar/15 22:27

Updated:: 03/Mar/15 09:07

Resolved:: 03/Mar/15 06:35