Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47230

The column pruning is not working as expected for nested struct in an array

    XMLWordPrintableJSON

Details

    Description

      See the code snippet below, when explode an array of struct and select one field in the struct,  some unexpected behaviour observed:

      • If the field in the struct is in the select clause, not in the where clause, the column pruning works as expected.
      • If the field in the struct is in the select clause and in the where clause, the column pruning not working.
      • If the field in the struct is not even selected, the column pruning not working
      from pyspark.sql import SparkSession
      from pyspark.sql.types import IntegerType, StructField, StructType, ArrayType
      import random
      
      spark = SparkSession.builder.appName("example").getOrCreate()# Create an RDD with an array of structs, each array having a random size between 5 to 10
      rdd = spark.range(1000).rdd.map(lambda x: (x.id + 3, [(x.id + i, x.id - i) for i in range(1, random.randint(5, 11))]))
      
      # Define a new schema
      
      schema = StructType([
          StructField("a", IntegerType(), True),
          StructField("b", ArrayType(StructType([
              StructField("x", IntegerType(), True),
              StructField("y", IntegerType(), True)
          ])), True)
      ])
      
      # Create a DataFrame with the new schema
      df = spark.createDataFrame(rdd, schema=schema)
      
      # Write the DataFrame to a parquet file
      df.repartition(1).write.mode('overwrite').parquet('test.parquet')
      
      # Read the parquet file back into a DataFrame
      df = spark.read.parquet('test.parquet') 
      df.createOrReplaceTempView("df_view")
      spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')
      
      # case 1, as expected
      sql_query = """
      SELECT a, EXPLODE(b.x) AS bb
      FROM df_view
      """
      spark.sql(sql_query).explain()
      
      # ReadSchema: struct<a:int,b:array<struct<x:int>>>
      
      # case 2, as expected
      sql_query = """
      SELECT a, bb.x 
      FROM df_view 
      lateral view explode(b) as bb
      """
      spark.sql(sql_query).explain()
      
      # ReadSchema: struct<a:int,b:array<struct<x:int>>>
      
      # case 3, bug: should only read b.x
      sql_query = """
      SELECT a, bb.x 
      FROM df_view 
      lateral view explode(b) as bb
      where bb.x is not null
      """
      spark.sql(sql_query).explain()
      
      #ReadSchema: struct<a:int,b:array<struct<x:int,y:int>>>
      
      #case 4, bug? seems no need to read both a and b
      sql_query = """
      SELECT a
      FROM df_view 
      lateral view explode(b) as bb
      """
      spark.sql(sql_query).explain()
      
      #ReadSchema: struct<a:int,b:array<struct<x:int,y:int>>>

       
       
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            mathfool Ling Qin
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: