[ARROW-18400] [Python] Quadratic memory usage of Table.to_pandas with nested data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.0.1
Fix Version/s: 11.0.0
Component/s: Python
Labels:
- pull-request-available
Environment:
Python 3.10.8 on Fedora Linux 36. AMD Ryzen 9 5900 X with 64 GB RAM

External issue URL:
https://github.com/apache/arrow/issues/20512

Description

Reading nested Parquet data and then converting it to a Pandas DataFrame shows quadratic memory usage and will eventually run out of memory for reasonably small files. I had initially thought this was a regression since 7.0.0, but it looks like 7.0.0 has similar quadratic memory usage that kicks in at higher row counts.

Example code to generate nested Parquet data:

import numpy as np
import random
import string
import pandas as pd

_characters = string.ascii_uppercase + string.digits + string.punctuation

def make_random_string(N=10):
    return ''.join(random.choice(_characters) for _ in range(N))

nrows = 1_024_000
filename = 'nested.parquet'

arr_len = 10
nested_col = []
for i in range(nrows):
    nested_col.append(np.array(
            [{
                'a': None if i % 1000 == 0 else np.random.choice(10000, size=3).astype(np.int64),
                'b': None if i % 100 == 0 else random.choice(range(100)),
                'c': None if i % 10 == 0 else make_random_string(5)
            } for i in range(arr_len)]
        ))
df = pd.DataFrame({'c1': nested_col})
df.to_parquet(filename)

And then read into a DataFrame with:

import pyarrow.parquet as pq
table = pq.read_table(filename)
df = table.to_pandas()

Only reading to an Arrow table isn't a problem, it's the to_pandas method that exhibits the large memory usage. I haven't tested generating nested Arrow data in memory without writing Parquet from Pandas but I assume the problem probably isn't Parquet specific.

Memory usage I see when reading different sized files on a machine with 64 GB RAM:

Num rows	Memory used with 10.0.1 (MB)	Memory used with 7.0.0 (MB)
32,000	362	361
64,000	531	531
128,000	1,152	1,101
256,000	2,888	1,402
512,000	10,301	3,508
1,024,000	38,697	5,313
2,048,000	OOM	20,061
4,096,000		OOM

With Arrow 10.0.1, memory usage approximately quadruples when row count doubles above 256k rows. With Arrow 7.0.0 memory usage is more linear but then quadruples from 1024k to 2048k rows.

PyArrow 8.0.0 shows similar memory usage to 10.0.1 so it looks like something changed between 7.0.0 and 8.0.0.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test_memory.py
21/Dec/22 11:39
5 kB
Alenka Frim

Issue Links

links to

GitHub Pull Request #15210

Activity

People

Assignee:: Will Jones

Reporter:: Adam Reeve

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 24/Nov/22 02:02

Updated:: 11/Jan/23 11:59

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 10m