Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28058

Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.4.1, 2.4.3
    • 2.4.4, 3.0.0
    • SQL
    • None

    Description

      The spark sql csv reader is not dropping malformed records as expected.

      Consider this file (fruit.csv). Notice it contains a header record, 3 valid records, and one malformed record.

      fruit,color,price,quantity
      apple,red,1,3
      banana,yellow,2,4
      orange,orange,3,5
      xxx
      

      If I read this file using the spark sql csv reader as follows, everything looks good. The malformed record is dropped.

      scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
      +------+------+-----+--------+                                                  
      |fruit |color |price|quantity|
      +------+------+-----+--------+
      |apple |red   |1    |3       |
      |banana|yellow|2    |4       |
      |orange|orange|3    |5       |
      +------+------+-----+--------+
      

      However, if I select a subset of the columns, the malformed record is not dropped. The malformed data is placed in the first column, and the remaining column(s) are filled with nulls.

      scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
      +------+
      |fruit |
      +------+
      |apple |
      |banana|
      |orange|
      |xxx   |
      +------+
      
      scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
      +------+------+
      |fruit |color |
      +------+------+
      |apple |red   |
      |banana|yellow|
      |orange|orange|
      |xxx   |null  |
      +------+------+
      
      scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price).show(truncate=false)
      +------+------+-----+
      |fruit |color |price|
      +------+------+-----+
      |apple |red   |1    |
      |banana|yellow|2    |
      |orange|orange|3    |
      |xxx   |null  |null |
      +------+------+-----+
      

      And finally, if I manually select all of the columns, the malformed record is once again dropped.

      scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 'quantity).show(truncate=false)
      +------+------+-----+--------+
      |fruit |color |price|quantity|
      +------+------+-----+--------+
      |apple |red   |1    |3       |
      |banana|yellow|2    |4       |
      |orange|orange|3    |5       |
      +------+------+-----+--------+
      

      I would expect the malformed record(s) to be dropped regardless of which columns are being selected from the file.

      Attachments

        Issue Links

          Activity

            People

              viirya L. C. Hsieh
              stwhit Stuart White
              Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: