Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29621

Querying internal corrupt record column should not be allowed in filter operation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.3.0
    • None
    • PySpark

    Description

      As per *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
      "Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column"

      But it's allowing while querying only the internal corrupt record column in case of filter operation.

      from pyspark.sql.types import *
      
      schema = StructType([
          StructField("_corrupt_record", StringType(), False),
          StructField("Name", StringType(), False),
          StructField("Colour", StringType(), True),
          StructField("Price", IntegerType(), True),
          StructField("Quantity", IntegerType(), True)])
      df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
      df.filter(df._corrupt_record.isNotNull()).show()  # Allowed
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            Patnaik Suchintak Patnaik
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: