Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-11742

ParquetSink fails for nullable fields

Details

    • Improvement
    • Status: Resolved
    • P3
    • Resolution: Fixed
    • None
    • 2.30.0
    • io-py-parquet
    • None
    • Patch

    Description

      Before pyarrow 0.15, it is not possible to create pyarrow record batch with schema.

      So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record batch we use 

       

      rb = pa.RecordBatch.from_arrays(arrays, self._schema.names)

      Error is raised that the parquet table to be created (record batch schema) has a different schema with the schema specify (self._schema).

      For example, when schema specified with "is not null", the record batch schema doesn't indicate that, the error will be raised.

       

      The fix is to use schema instead of names in pa.RecordBatch.from_arrays

      rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema)

       

      Attachments

        Issue Links

          Activity

            People

              wenbing-bai Wenbing Bai
              wenbing-bai Wenbing Bai
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 50m
                  2h 50m