Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2587

[Python] Unable to write StructArrays with multiple children to parquet

    XMLWordPrintableJSON

Details

    Description

      Although I am able to read StructArray from parquet, I am still unable to write it back from pa.Table to parquet.

      I get an "ArrowInvalid: Nested column branch had multiple children"

      Here is a quick example:

      In [2]: import pyarrow.parquet as pq
      
      In [3]: table = pq.read_table('test.parquet')
      
      In [4]: table
       Out[4]: 
       pyarrow.Table
       weight: double
       animal_type: string
       animal_interpretation: struct<is_large_animal: bool, is_mammal: bool>
         child 0, is_large_animal: bool
         child 1, is_mammal: bool
       metadata
       --------
       \{'org.apache.spark.sql.parquet.row.metadata': '{"type":"struct","fields":[{"name":"weight","type":"double","nullable":true,"metadata":{}},\{"name":"animal_type","type":"string","nullable":true,"metadata":{}},{"name":"animal_interpretation","type":{"type":"struct","fields":[\\{"name":"is_large_animal","type":"boolean","nullable":true,"metadata":{}},\\\{"name":"is_mammal","type":"boolean","nullable":true,"metadata":{}}]},"nullable":false,"metadata":{}}]}'}
      
      In [5]: table.schema
       Out[5]: 
       weight: double
       animal_type: string
       animal_interpretation: struct<is_large_animal: bool, is_mammal: bool>
         child 0, is_large_animal: bool
         child 1, is_mammal: bool
       metadata
       --------
       \{'org.apache.spark.sql.parquet.row.metadata': '{"type":"struct","fields":[{"name":"weight","type":"double","nullable":true,"metadata":{}},\{"name":"animal_type","type":"string","nullable":true,"metadata":{}},{"name":"animal_interpretation","type":{"type":"struct","fields":[\\{"name":"is_large_animal","type":"boolean","nullable":true,"metadata":{}},\\\{"name":"is_mammal","type":"boolean","nullable":true,"metadata":{}}]},"nullable":false,"metadata":{}}]}'}
      
      In [6]: pq.write_table(table,"test_write.parquet")
       ---------------------------------------------------------------------------
       ArrowInvalid                              Traceback (most recent call last)
       <ipython-input-6-bd9d7deee437> in <module>()
       ----> 1 pq.write_table(table,"test_write.parquet")
      
      /usr/local/lib/python2.7/dist-packages/pyarrow/parquet.pyc in write_table(table, where, row_group_size, version, use_dictionary, compression, use_deprecated_int96_timestamps, coerce_timestamps, flavor, **kwargs)
           982                 use_deprecated_int96_timestamps=use_int96,
           983                 **kwargs) as writer:
       --> 984             writer.write_table(table, row_group_size=row_group_size)
           985     except Exception:
           986         if is_path(where):
      
      /usr/local/lib/python2.7/dist-packages/pyarrow/parquet.pyc in write_table(self, table, row_group_size)
           325             table = _sanitize_table(table, self.schema, self.flavor)
           326         assert self.is_open
       --> 327         self.writer.write_table(table, row_group_size=row_group_size)
           328 
           329     def close(self):
      
      /usr/local/lib/python2.7/dist-packages/pyarrow/_parquet.so in pyarrow._parquet.ParquetWriter.write_table()
      
      /usr/local/lib/python2.7/dist-packages/pyarrow/lib.so in pyarrow.lib.check_status()
      
      ArrowInvalid: Nested column branch had multiple children
      
      

       

      I would really appreciate a fix on this.

      Best,

      Jacques

      Attachments

        Issue Links

          Activity

            People

              emkornfield@gmail.com Micah Kornfield
              jafournier jacques
              Votes:
              6 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h