Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30645

collect() support Unicode charactes tests fails on Windows

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 2.4.5, 3.0.0
    • SparkR, Tests
    • None

    Description

      As-is test_that("collect() support Unicode characters" case seems to be system dependent, and doesn't work properly on Windows with CP1252 English locale:

       

      library(SparkR)
      SparkR::sparkR.session()
      Sys.info()
      #           sysname           release           version 
      #         "Windows"      "Server x64"     "build 17763" 
      #          nodename           machine             login 
      # "WIN-5BLT6Q610KH"          "x86-64"   "Administrator" 
      #              user    effective_user 
      #   "Administrator"   "Administrator" 
      
      Sys.getlocale()
      
      # [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
      
      lines <- c("{\"name\":\"안녕하세요\"}",
                 "{\"name\":\"您好\", \"age\":30}",
                 "{\"name\":\"こんにちは\", \"age\":19}",
                 "{\"name\":\"Xin chào\"}")
      
      system(paste0("cat ", jsonPath))
      # {"name":"<U+C548><U+B155><U+D558><U+C138><U+C694>"}
      # {"name":"<U+60A8><U+597D>", "age":30}
      # {"name":"<U+3053><U+3093><U+306B><U+3061><U+306F>", "age":19}
      # {"name":"Xin chào"}
      # [1] 0
      
      
      jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
      writeLines(lines, jsonPath)
      
      df <- read.df(jsonPath, "json")
      
      
      printSchema(df)
      # root
      #  |-- _corrupt_record: string (nullable = true)
      #  |-- age: long (nullable = true)
      #  |-- name: string (nullable = true)
      
      head(df)
      #              _corrupt_record age                                     name
      # 1                       <NA>  NA <U+C548><U+B155><U+D558><U+C138><U+C694>
      # 2                       <NA>  30                         <U+60A8><U+597D>
      # 3                       <NA>  19 <U+3053><U+3093><U+306B><U+3061><U+306F>
      # 4 {"name":"Xin ch<U+FFFD>o"}  NA                                     <NA>
      
      

      Problem becomes visible on AppVoyer when testthat is updated to 2.x, but somehow silenced when testthat 1.x is used.

      Attachments

        Issue Links

          Activity

            People

              zero323 Maciej Szymkiewicz
              zero323 Maciej Szymkiewicz
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: