Description
Trying to filter on is null from values loaded from a CSV file has regressed recently and now produces incorrect results.
Given a CSV file with the contents:
floats.csv
100.0,1.0, 200.0,, 300.0,3.0, 1.0,4.0, ,4.0, 500.0,, ,6.0, -500.0,50.5
Filtering this data for the first column being null should return exactly two rows, but it is returning extraneous rows with nulls:
scala> val schema = StructType(Array(StructField("floats", FloatType, true),StructField("more_floats", FloatType, true))) schema: org.apache.spark.sql.types.StructType = StructType(StructField(floats,FloatType,true), StructField(more_floats,FloatType,true)) scala> val df = spark.read.schema(schema).csv("floats.csv") df: org.apache.spark.sql.DataFrame = [floats: float, more_floats: float] scala> df.filter("floats is null").show +------+-----------+ |floats|more_floats| +------+-----------+ | null| null| | null| null| | null| null| | null| null| | null| 4.0| | null| null| | null| 6.0| +------+-----------+
Attachments
Issue Links
- is caused by
-
SPARK-30323 Support filters pushdown in CSV datasource
- Resolved
- links to