[ARROW-15045] PyArrow SIGSEGV error when using UnionDatasets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 6.0.1
Fix Version/s: None
Component/s: Python
Labels:
- dataset
Environment:
Fedora Linux 35 (Workstation Edition), AMD Ryzen 5950X.

External issue URL:
https://github.com/apache/arrow/issues/30562
Language:
- Python

Description

The context:

I am using PyArrow to read a folder structured as exchange/symbol/date.parquet. The folder contains multiple exchanges, multiple symbols and multiple files. At the time I am writing the folder is about 30GB/1.85M files.

If I use a single PyArrow Dataset to read/manage the entire folder, the simplest process with just the dataset defined will occupy 2.3GB of RAM. The problem is, I am instanciating this dataset on multiple processes but since every process only needs some exchanges (typically just one), I don't need to read all folders and files in every single process.

So I tried to use a UnionDataset composed of single exchange Dataset. In this way, every process just loads the required folder/files as a dataset. By a simple test, by doing so every process now occupy just 868MB of RAM, -63%.

The problem:

When using a single Dataset for the entire folder/files, I have no problem at all. I can read filtered data without problems and it's fast as duck.

But when I read the UnionDataset filtered data, I always get Process finished with exit code 139 (interrupted by signal 11: SIGSEGV error. So after looking every single source of the problem, I noticed that if I create a dummy folder with multiple exchanges but just some symbols, in order to limit the files amout to read, I don't get that error and it works normally. If I then copy new symbols folders (any) I get again that error.

I came up thinking that the problem is not about my code, but linked instead to the amout of files that the UnionDataset is able to manage.

Am I correct or am I doing something wrong? Thank you all, have a nice day and good work.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Thomas Cercato

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Dec/21 16:27

Updated:: 11/Jan/23 08:44