[ARROW-16272] [C++][Python] Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv` - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.1, 5.0.0, 7.0.0
Fix Version/s: 9.0.0
Component/s: C++, Python
Labels:
- S3FileSystem
- csv
- pandas
- pull-request-available
- s3
Environment:
MacOS 12.1
MacBook Pro
Intel x86

External issue URL:
https://github.com/apache/arrow/issues/31666

Description

`pyarrow.fs.S3FileSystem.open_input_file` and `pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with Pandas' `read_csv`.

import pandas as pd
import time
from pyarrow.fs import S3FileSystem

def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    fs = S3FileSystem(
        anonymous=True,
        region="us-east-2",
        endpoint_override=None,
        proxy_options=None,
    )

    print("Time to create fs: ", time.time() - t0)
    t0 = time.time()
    # fhandler = fs.open_input_stream(
    #     "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    # )
    fhandler = fs.open_input_file(
        "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    )
    print("Time to create fhandler: ", time.time() - t0)
    t0 = time.time()
    year_2016_df = pd.read_csv(
        fhandler,
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df

t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)

Output:

Running...
Time to create fs:  0.0003612041473388672
Time to create fhandler:  0.22461509704589844
read time: 105.76488208770752
total time: 105.99135684967041

This is with `pandas==1.4.2`.

Getting similar performance with `fs.open_input_stream` as well (commented out in the code).

Running...
Time to create fs:  0.0002570152282714844
Time to create fhandler:  0.18540692329406738
read time: 186.8419930934906
total time: 187.03169012069702

When running it with just pandas (which uses `s3fs` under the hood), it's much faster:

import pandas as pd
import time

def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    year_2016_df = pd.read_csv(
        "s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df

t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)

Output:

Running...
read time: 1.1012001037597656
total time: 1.101264238357544

Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs performance:

import pandas as pd
import time
from pyarrow.fs import S3FileSystem
from fsspec.implementations.arrow import ArrowFSWrapper

def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    fs = ArrowFSWrapper(
        S3FileSystem(
            anonymous=True,
            region="us-east-2",
            endpoint_override=None,
            proxy_options=None,
        )
    )

    print("Time to create fs: ", time.time() - t0)
    t0 = time.time()
    fhandler = fs._open(
        "bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    )
    print("Time to create fhandler: ", time.time() - t0)
    t0 = time.time()
    year_2016_df = pd.read_csv(
        fhandler,
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df

t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)

Output:

Running...
Time to create fs:  0.0002467632293701172
Time to create fhandler:  0.1858382225036621
read time: 0.13701486587524414
total time: 0.3232450485229492

Packages:

pyarrow=7.0.0
pandas : 1.4.2
numpy : 1.20.3

I tested it with 4.0.1, 5.0.0 as well and saw similar results.

Attachments

Issue Links

links to

GitHub Pull Request #13005

GitHub Pull Request #13264

Activity

People

Assignee:: Antoine Pitrou

Reporter:: Sahil Gupta

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Apr/22 18:34

Updated:: 11/Jan/23 11:42

Resolved:: 31/May/22 13:30

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 20m