Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30530

CSV load followed by "is null" filter produces incorrect results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • SQL
    • None

    Description

      Trying to filter on is null from values loaded from a CSV file has regressed recently and now produces incorrect results.

      Given a CSV file with the contents:

      floats.csv
      100.0,1.0,
      200.0,,
      300.0,3.0,
      1.0,4.0,
      ,4.0,
      500.0,,
      ,6.0,
      -500.0,50.5
       

      Filtering this data for the first column being null should return exactly two rows, but it is returning extraneous rows with nulls:

      scala> val schema = StructType(Array(StructField("floats", FloatType, true),StructField("more_floats", FloatType, true)))
      schema: org.apache.spark.sql.types.StructType = StructType(StructField(floats,FloatType,true), StructField(more_floats,FloatType,true))
      
      scala> val df = spark.read.schema(schema).csv("floats.csv")
      df: org.apache.spark.sql.DataFrame = [floats: float, more_floats: float]
      
      scala> df.filter("floats is null").show
      +------+-----------+
      |floats|more_floats|
      +------+-----------+
      |  null|       null|
      |  null|       null|
      |  null|       null|
      |  null|       null|
      |  null|        4.0|
      |  null|       null|
      |  null|        6.0|
      +------+-----------+
      

      Attachments

        Issue Links

          Activity

            People

              maxgekk Max Gekk
              jlowe Jason Darrell Lowe
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: