[ARROW-6481] [Python][C++] Bad performance of read_csv() with column_types - ASF JIRA

XML

Word

Printable

JSON

Case: Dataset wit 20k columns. Amount of rows can be 0.

pyarrow.csv.read_csv('20k_cols.csv') works rather fine if no convert_options provided.

Took 150ms.

Now I call read_csv() with column types mapping that marks 2000 out of these columns as string.

pyarrow.csv.read_csv('20k_cols.csv', convert_options=pyarrow.csv.ConvertOptions(column_types={'K%d' % i: pyarrow.string() for i in range(2000)}))

(K1..K19999 are column names in attached dataset).

My task globally is to read everything as string, avoid any inferring.

This takes several minutes, consumes around 4GB memory.

This doesn't look sane at all.

links to

GitHub Pull Request #5334

Estimated:

Not Specified

Remaining:

Logged:

2h 20m