Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18400

[Python] Quadratic memory usage of Table.to_pandas with nested data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Critical
    • Resolution: Unresolved
    • 10.0.1
    • 11.0.0
    • Python
    • Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X with 64 GB RAM

    Description

      Reading nested Parquet data and then converting it to a Pandas DataFrame shows quadratic memory usage and will eventually run out of memory for reasonably small files. I had initially thought this was a regression since 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks in at higher row counts.

      Example code to generate nested Parquet data:

      import numpy as np
      import random
      import string
      import pandas as pd
      
      _characters = string.ascii_uppercase + string.digits + string.punctuation
      
      def make_random_string(N=10):
          return ''.join(random.choice(_characters) for _ in range(N))
      
      nrows = 1_024_000
      filename = 'nested.parquet'
      
      arr_len = 10
      nested_col = []
      for i in range(nrows):
          nested_col.append(np.array(
                  [{
                      'a': None if i % 1000 == 0 else np.random.choice(10000, size=3).astype(np.int64),
                      'b': None if i % 100 == 0 else random.choice(range(100)),
                      'c': None if i % 10 == 0 else make_random_string(5)
                  } for i in range(arr_len)]
              ))
      df = pd.DataFrame({'c1': nested_col})
      df.to_parquet(filename)
      

      And then read into a DataFrame with:

      import pyarrow.parquet as pq
      table = pq.read_table(filename)
      df = table.to_pandas()
      

      Only reading to an Arrow table isn't a problem, it's the to_pandas method that exhibits the large memory usage. I haven't tested generating nested Arrow data in memory without writing Parquet from Pandas but I assume the problem probably isn't Parquet specific.

      Memory usage I see when reading different sized files on a machine with 64 GB RAM:

      Num rows Memory used with 10.0.1 (MB) Memory used with 7.0.0 (MB)
      32,000 362 361
      64,000 531 531
      128,000 1,152 1,101
      256,000 2,888 1,402
      512,000 10,301 3,508
      1,024,000 38,697 5,313
      2,048,000 OOM 20,061
      4,096,000   OOM

      With Arrow 10.0.1, memory usage approximately quadruples when row count doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but then quadruples from 1024k to 2048k rows.

      PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something changed between 7.0.0 and 8.0.0.

      Attachments

        1. test_memory.py
          5 kB
          Alenka Frim

        Issue Links

          Activity

            People

              wjones127 Will Jones
              adreeve Adam Reeve
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m