Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12124

[Rust] Parquet writer creates invalid parquet files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Bug
    • None
    • None
    • Rust
    • None

    Description

      I wrote a simple CSV to Parquet converter at https://github.com/domoritz/csv2parquet/blob/f53feb5bd995eab41dee09f2c4d722512052d7ca/src/main.rs.

      Running it (`csv2parquet test.txt test.parquet`) with a simple file such as

      ```
      a,b,c
      0,1,hello world
      0,1,hello world
      0,1,hello world
      0,1,hello world
      0,1,hello world
      0,1,hello world
      0,1,hello world
      ```

      And then trying to read in Python with

      ```
      import pandas as pd
      df = pd.read_parquet('test.parquet')
      df.to_csv('test2.csv')
      ```

      Results in this error

      ```
      OSError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
      ```

      The schema seems to be inferred correctly

      ```
      Inferred Schema:
      {
      "fields": [
      {
      "name": "a",
      "nullable": false,
      "type":

      { "name": "int", "bitWidth": 64, "isSigned": true }

      ,
      "children": []
      },
      {
      "name": "b",
      "nullable": false,
      "type":

      { "name": "int", "bitWidth": 64, "isSigned": true }

      ,
      "children": []
      },
      {
      "name": "c",
      "nullable": false,
      "type":

      { "name": "utf8" }

      ,
      "children": []
      }
      ],
      "metadata": {}
      }
      ```

      Attachments

        Activity

          People

            Unassigned Unassigned
            domoritz Dominik Moritz
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: