[SPARK-29621] Querying internal corrupt record column should not be allowed in filter operation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: PySpark
Labels:
- PySpark
- SparkSQL

Description

As per *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column"

But it's allowing while querying only the internal corrupt record column in case of filter operation.

from pyspark.sql.types import *

schema = StructType([
    StructField("_corrupt_record", StringType(), False),
    StructField("Name", StringType(), False),
    StructField("Colour", StringType(), True),
    StructField("Price", IntegerType(), True),
    StructField("Quantity", IntegerType(), True)])
df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
df.filter(df._corrupt_record.isNotNull()).show()  # Allowed

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Suchintak Patnaik

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 28/Oct/19 13:32

Updated:: 12/Dec/22 18:10

Resolved:: 30/Oct/19 01:13