Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15627

[R] Support unify_schemas for union datasets

    XMLWordPrintableJSON

Details

    Description

      Also out of discussion on https://github.com/apache/arrow/issues/12371

      You can unify schemas between different parquet files, but it seems like you can't union together two (or more) datasets that have different schemas. This is odd, because we do compute the unified schema onĀ this line, only to later assert all the schemas are the same.

      library(arrow)
      library(dplyr)
      
      df1 <- arrow_table(x = array(c(1, 2, 3)),
                         y = array(c("a", "b", "c")))
      df2 <- arrow_table(x = array(c(4, 5)),
                         z = array(c("d", "e")))
      
      df1 %>% write_dataset("example1", format="parquet")
      df2 %>% write_dataset("example2", format="parquet")
      
      ds1 <- open_dataset("example1", format="parquet")
      ds2 <- open_dataset("example2", format="parquet")
      
      # These don't work
      ds <- c(ds1, ds2) # c() actually does the same thing
      ds <- open_dataset(list(ds1, ds2)) # This fails due to mismatch in schema
      ds <- open_dataset(c("example1", "example2"), format="parquet", unify_schemas = TRUE)
      
      # This does
      ds <- open_dataset(c("example2/part-0.parquet", "example1/part-0.parquet"), format="parquet", unify_schemas = TRUE)
      

      Attachments

        Issue Links

          Activity

            People

              wjones127 Will Jones
              wjones127 Will Jones
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m