[SPARK-47230] The column pruning is not working as expected for nested struct in an array - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.3.0, 3.4.0, 3.5.0, 3.5.1
Fix Version/s: None
Component/s: Optimizer, Spark Core, Spark Shell
Labels:
- pull-request-available

Description

See the code snippet below, when explode an array of struct and select one field in the struct, some unexpected behaviour observed:

If the field in the struct is in the select clause, not in the where clause, the column pruning works as expected.
If the field in the struct is in the select clause and in the where clause, the column pruning not working.
If the field in the struct is not even selected, the column pruning not working

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StructField, StructType, ArrayType
import random

spark = SparkSession.builder.appName("example").getOrCreate()# Create an RDD with an array of structs, each array having a random size between 5 to 10
rdd = spark.range(1000).rdd.map(lambda x: (x.id + 3, [(x.id + i, x.id - i) for i in range(1, random.randint(5, 11))]))

# Define a new schema

schema = StructType([
    StructField("a", IntegerType(), True),
    StructField("b", ArrayType(StructType([
        StructField("x", IntegerType(), True),
        StructField("y", IntegerType(), True)
    ])), True)
])

# Create a DataFrame with the new schema
df = spark.createDataFrame(rdd, schema=schema)

# Write the DataFrame to a parquet file
df.repartition(1).write.mode('overwrite').parquet('test.parquet')

# Read the parquet file back into a DataFrame
df = spark.read.parquet('test.parquet') 
df.createOrReplaceTempView("df_view")
spark.conf.set('spark.sql.optimizer.nestedSchemaPruning.enabled', 'true')

# case 1, as expected
sql_query = """
SELECT a, EXPLODE(b.x) AS bb
FROM df_view
"""
spark.sql(sql_query).explain()

# ReadSchema: struct<a:int,b:array<struct<x:int>>>

# case 2, as expected
sql_query = """
SELECT a, bb.x 
FROM df_view 
lateral view explode(b) as bb
"""
spark.sql(sql_query).explain()

# ReadSchema: struct<a:int,b:array<struct<x:int>>>

# case 3, bug: should only read b.x
sql_query = """
SELECT a, bb.x 
FROM df_view 
lateral view explode(b) as bb
where bb.x is not null
"""
spark.sql(sql_query).explain()

#ReadSchema: struct<a:int,b:array<struct<x:int,y:int>>>

#case 4, bug? seems no need to read both a and b
sql_query = """
SELECT a
FROM df_view 
lateral view explode(b) as bb
"""
spark.sql(sql_query).explain()

#ReadSchema: struct<a:int,b:array<struct<x:int,y:int>>>

Attachments

Issue Links

links to

GitHub Pull Request #45839

GitHub Pull Request #45953

Activity

People

Assignee:: Unassigned

Reporter:: Ling Qin

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Feb/24 15:31

Updated:: 10/Apr/24 10:18