Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6481

[Python][C++] Bad performance of read_csv() with column_types

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.14.1
    • 0.15.0
    • C++, Python
    • ubuntu xenial, python2.7

    Description

      Case: Dataset wit 20k columns. Amount of rows can be 0.

      pyarrow.csv.read_csv('20k_cols.csv') works rather fine if no convert_options provided.

      Took 150ms.

      Now I call read_csv() with column types mapping that marks 2000 out of these columns as string.

      pyarrow.csv.read_csv('20k_cols.csv', convert_options=pyarrow.csv.ConvertOptions(column_types={'K%d' % i: pyarrow.string() for i in range(2000)}))

      (K1..K19999 are column names in attached dataset).

      My task globally is to read everything as string, avoid any inferring.

      This takes several minutes, consumes around 4GB memory.

      This doesn't look sane at all.

      Attachments

        1. 20k_cols.csv
          126 kB
          Bogdan Klichuk

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              klichukb Bogdan Klichuk
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 20m
                  2h 20m