Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11257

[C++][Parquet] PyArrow Table contains different data after writing and reloading from Parquet

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 2.0.0
    • 3.0.0
    • Python
    • None

    Description

      • I'm loading a JSONlines object into a table using 
        pa.json.readjson

        It contains one column that is a nested dictionary.

      • I select a row by key and inspect its nested dictionary.
      • I write the table to parquet 
      • I load the table again from the parquet file 
      • I check the same key and the nested dictionary is not the same.

       

      To reproduce:

       

      Find the attached JSONLines file and Jupyter Notebook.

      The json file contains entries per customer with a generated `msisdn`, `scoring_request_id` and `scorecard_result` object. Each `scorecard result consists of a list of feature objects, all with the value the same as the msidn` and a score.

      The notebook reads the file and demonstrates the issue.

       

      Attachments

        1. anonymised.jsonl
          2.61 MB
          Kari Schoonbee
        2. pyarrow_parquet_issue.ipynb
          4 kB
          Kari Schoonbee

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kari_s Kari Schoonbee
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: