Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.3
Description
Right now, batch DataFrame always changes the schema to nullable automatically (See this line: https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L399).
However, streaming DataFrame's schema is read in this line https://github.com/apache/spark/blob/325bc8e9c6187a96b33a033fbb0145dfca619135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L259 which doesn't change the schema to nullable automatically.
We should make streaming DataFrame consistent with batch.
It can cause corrupted parquet files due to the schema mismatch.
Attachments
Issue Links
- links to