Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28651

Streaming file source doesn't change the schema to nullable automatically

    XMLWordPrintableJSON

    Details

    • Docs Text:
      Hide
      All fields of the Structured Streaming's file source schema will be forced to be nullable since Spark 3.0.0. This protects users from corruptions when the specified or inferred schema is not compatible with actual data. If you would like the original behavior, you can set the SQL conf "spark.sql.streaming.fileSource.schema.forceNullable" to "false". This flag is added to reduce the migration work when upgrading to Spark 3.0.0 and will be removed in future. Please update your codes to work with the new behavior as soon as possible.
      Show
      All fields of the Structured Streaming's file source schema will be forced to be nullable since Spark 3.0.0. This protects users from corruptions when the specified or inferred schema is not compatible with actual data. If you would like the original behavior, you can set the SQL conf "spark.sql.streaming.fileSource.schema.forceNullable" to "false". This flag is added to reduce the migration work when upgrading to Spark 3.0.0 and will be removed in future. Please update your codes to work with the new behavior as soon as possible.

      Description

      Right now, batch DataFrame always changes the schema to nullable automatically (See this line: https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).

      However, streaming DataFrame's schema is read in this line https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259 which doesn't change the schema to nullable automatically.

      We should make streaming DataFrame consistent with batch.

      It can cause corrupted parquet files due to the schema mismatch.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                zsxwing Shixiong Zhu
                Reporter:
                tom.magdanski Tomasz Magdanski
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: