Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15081

[R][C++] Arrow crashes (OOM) on R client with large remote parquet files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • R
    • None

    Description

      The below should be a reproducible crash:

      library(arrow)
      library(dplyr)
      server <- arrow::s3_bucket("ebird",endpoint_override = "minio.cirrus.carlboettiger.info")
      
      path <- server$path("Oct-2021/observations")
      obs <- arrow::open_dataset(path)
      
      path$ls() # observe -- 1 parquet file
      
      obs %>% count() # CRASH
      
      obs %>% to_duckdb() # also crash

      I have attempted to split this large (~100 GB parquet file) into some smaller files, which helps:

      path <- server$path("partitioned")
      obs <- arrow::open_dataset(path)
      obs$ls() # observe, multiple parquet files now
      obs %>% count() 
       

      (These parquet files have also been created by arrow, btw, from a single large csv file provided by the original data provider (eBird).  Unfortunately generating the partitioned versions is cumbersome as the data is very unevenly distributed, there's few columns that can avoid creating 1000s of parquet partition files and even so the bulk of the 1-billion rows fall within the same group.  But all the same I think this is a bug as there's no indication why arrow cannot handle a single 100GB parquet file I think?). 

       

      Let me know if I can provide more info! I'm testing in R with latest CRAN version of arrow on a machine with 200 GB RAM. 

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              cboettig Carl Boettiger
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: