[SPARK-29058] Reading csv file with DROPMALFORMED showing incorrect record count - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
- bulk-closed

Description

The spark sql csv reader is dropping malformed records as expected, but the record count is showing as incorrect.

Consider this file (fruit.csv)

apple,red,1,3
banana,yellow,2,4.56
orange,orange,3,5

Defining schema as follows:

schema = "Fruit string,color string,price int,quantity int"

Notice that the "quantity" field is defined as integer type, but the 2nd row in the file contains a floating point value, hence it is a corrupt record.

>>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
>>> df.show()
+------+------+-----+--------+
| Fruit| color|price|quantity|
+------+------+-----+--------+
| apple|   red|    1|       3|
|orange|orange|    3|       5|
+------+------+-----+--------+

>>> df.count()
3

Malformed record is getting dropped as expected, but incorrect record count is getting displayed.

Here the df.count() should give value as 2

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Suchintak Patnaik

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Sep/19 17:36

Updated:: 12/Dec/22 18:11

Resolved:: 25/May/21 01:42