Details
-
Improvement
-
Status: Resolved
-
P3
-
Resolution: Fixed
-
None
-
None
-
Patch
Description
Before pyarrow 0.15, it is not possible to create pyarrow record batch with schema.
So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record batch we use
rb = pa.RecordBatch.from_arrays(arrays, self._schema.names)
Error is raised that the parquet table to be created (record batch schema) has a different schema with the schema specify (self._schema).
For example, when schema specified with "is not null", the record batch schema doesn't indicate that, the error will be raised.
The fix is to use schema instead of names in pa.RecordBatch.from_arrays
rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema)
Attachments
Issue Links
- links to