Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27069

Spark(2.3.2) LDA transfomation memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:1232

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.3.2
    • None
    • ML
    • None

    Description

      I trained LDA(feature dimension : 100, iteration: 100 or 50, Distributed version , ml ) using Spark 2.3.2(emr-5.18.0) .
      After that I want to transform new DataSet by using that model. But when I transform new data, I alway get error related memory error.
      I changed data size from x 0.1 , to x 0.01. But always get memory error(java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
       
      That hugeCapacity error(overflow) is happened when size of array is over Integer.MAX_VALUE - 8. But I changed data size to small size. I can't find why this error is happened.

      And I want to change serializer to KryoSerializer. But I found 
      this org.apache.spark.util.ClosureCleaner$.ensureSerializable always call org.apache.spark.serializer.JavaSerializationStream even though I register KryoClasses
       

      Is there any thing I can do ?

       
      Below is code

       
      val countvModel = CountVectorizerModel.load("s3://~/")
      val ldaModel = DistributedLDAModel.load("s3://~/")
      val transformeddata=countvModel.transform(inputData).select("productid", "itemid", "ptkString", "features")
      var featureldaDF = ldaModel.transform(transformeddata).select("productid", "itemid", "topicDistribution", "ptkString").toDF("productid", "itemid", "features", "ptkString") featureldaDF=featureldaDF.persist //this is 328 line
      

       
       

      Other testing

      • Java option : UseParallelGC , UseG1GC (all fail)

      Below is log

      19/03/05 20:59:03 ERROR ApplicationMaster: User class threw exception: java.lang.OutOfMemoryError java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at org.apache.spark.SparkContext.clean(SparkContext.scala:2299) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:608) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:107) at org.apache.spark.sql.execution.columnar.InMemoryRelation.<init>(InMemoryRelation.scala:102) at org.apache.spark.sql.execution.columnar.InMemoryRelation$.apply(InMemoryRelation.scala:43) at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:97) at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:67) at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:91) at org.apache.spark.sql.Dataset.persist(Dataset.scala:2907) at coupang.cs.predictforxgboost.App$.main(App.scala:328) at coupang.cs.predictforxgboost.App.main(App.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
      

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              linetor TAESUK KIM
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: