Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18439

Misleading message when loading parquet data with invalid null data

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 10.0.1
    • None
    • Python
    • None

    Description

      I'm saving an arrow table to parquet. One column is a list of structs, which elements are marked as non nullable. But the data isn't valid because I've put a null in one of the nested field. 

      When I save this data to parquet and try to load it back I get a very misleading message:

       Length spanned by list offsets (2) larger than values array (length 1)

      I would rather arrow complains when creating the table or when saving it to parquet.

      Here's how to reproduce the issue:

      struct = pa.struct(
          [
              pa.field("nested_string", pa.string(), nullable=False),
          ]
      )
      
      schema = pa.schema(
          [pa.field("list_column", pa.list_(pa.field("item", struct, nullable=False)))]
      )
      table = pa.table(
          {"list_column": [[{"nested_string": ""}, {"nested_string": None}]]}, schema=schema
      )
      with io.BytesIO() as file:
          pq.write_table(table, file)
          file.seek(0)
          pq.read_table(file) # Raises pa.ArrowInvalid
       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            0x26dres &res
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: