Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25587

NPE in Dataset when reading from Parquet as Product

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.4.0
    • None
    • SQL
    • None

    Description

      In an attempt to replicate the following issue in ADAM, a library downstream of Spark
      https://github.com/bigdatagenomics/adam/issues/2058

      also reported as
      https://issues.apache.org/jira/browse/SPARK-25588

      the following Spark Shell script throws NPE when attempting to read from Parquet.

      sc.setLogLevel("INFO")
      
      import spark.implicits._
      
      case class Inner(
        names: Seq[String] = Seq()
      ) 
      
      case class Outer(
        inners: Seq[Inner] = Seq()
      )
      
      val inner = Inner(Seq("name0", "name1"))
      val outer = Outer(Seq(inner))
      val dataset = sc.parallelize(Seq(outer)).toDS()
      
      val path = "outers.parquet"
      dataset.toDF().write.format("parquet").save(path)
      
      val roundtrip = spark.read.parquet(path).as[Outer]
      roundtrip.first
      

      Stack trace

      $ spark-shell -i failure.scala
      ...
      2018-10-01 16:57:48 INFO  ParquetWriteSupport:54 - Initialized Parquet WriteSupport with Catalyst schema:
      {
        "type" : "struct",
        "fields" : [ {
          "name" : "inners",
          "type" : {
            "type" : "array",
            "elementType" : {
              "type" : "struct",
              "fields" : [ {
                "name" : "names",
                "type" : {
                  "type" : "array",
                  "elementType" : "string",
                  "containsNull" : true
                },
                "nullable" : true,
                "metadata" : { }
              } ]
            },
            "containsNull" : true
          },
          "nullable" : true,
          "metadata" : { }
        } ]
      }
      and corresponding Parquet message type:
      message spark_schema {
        optional group inners (LIST) {
          repeated group list {
            optional group element {
              optional group names (LIST) {
                repeated group list {
                  optional binary element (UTF8);
                }
              }
            }
          }
        }
      }
      
      16:57:48 INFO  InternalParquetRecordWriter:165 - Flushing mem columnStore to file. allocated memory: 0
      2018-10-01 16:57:48 INFO  InternalParquetRecordWriter:165 - Flushing mem columnStore to file. allocated memory: 26
      ...
      2018-10-01 16:57:49 INFO  FileSourceStrategy:54 - Output Data Schema: struct<inners: array<struct<names:array<string>>>>
      2018-10-01 16:57:49 INFO  FileSourceScanExec:54 - Pushed Filters:
      java.lang.NullPointerException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
        at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
        at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
        at scala.Option.map(Option.scala:146)
        at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
        at org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
        at org.apache.spark.sql.catalyst.expressions.objects.MapObjects.doGenCode(objects.scala:796)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
        at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99)
        at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:296)
        at org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$class.prepareArguments(objects.scala:98)
        at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.prepareArguments(objects.scala:431)
        at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:483)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
        at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
        at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:155)
        at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:152)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:296)
        at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:152)
        at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:38)
        at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1193)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2545)
        at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
        at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:2552)
        at org.apache.spark.sql.Dataset.first(Dataset.scala:2559)
        ... 59 elided
      

      Regression from Spark version 2.3.1, which runs the above Spark Shell script correctly.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              heuermh Michael Heuer
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: