Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44940

Improve performance of JSON parsing when "spark.sql.json.enablePartialResults" is enabled

    XMLWordPrintableJSON

Details

    Description

      Follow-up on https://issues.apache.org/jira/browse/SPARK-40646.

      I found that JSON parsing is significantly slower due to exception creation in control flow. Also, some fields are not parsed correctly and the exception is thrown in certain cases: 

      Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow
      	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct(rows.scala:51)
      	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getStruct$(rows.scala:51)
      	at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getStruct(rows.scala:195)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
      	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
      	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:590)
      	... 39 more 

      Attachments

        Activity

          People

            ivan.sadikov Ivan Sadikov
            ivan.sadikov Ivan Sadikov
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: