Description
It seems like a missing feature that you can't compare structs in a filter on a DataFrame.
Here's a simple demonstration of a) where this would be useful and b) how it's different from simply comparing each of the components of the structs.
import pyspark from pyspark.sql.functions import col, struct, concat spark = pyspark.sql.SparkSession.builder.getOrCreate() df = spark.createDataFrame( [ ('Boston', 'Bob'), ('Boston', 'Nick'), ('San Francisco', 'Bob'), ('San Francisco', 'Nick'), ], ['city', 'person'] ) pairs = ( df.select( struct('city', 'person').alias('p1') ) .crossJoin( df.select( struct('city', 'person').alias('p2') ) ) ) print("Everything") pairs.show() print("Comparing parts separately (doesn't give me what I want)") (pairs .where(col('p1.city') < col('p2.city')) .where(col('p1.person') < col('p2.person')) .show()) print("Comparing parts together with concat (gives me what I want but is hacky)") (pairs .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person')) .show()) print("Comparing parts together with struct (my desired solution but currently yields an error)") (pairs .where(col('p1') < col('p2')) .show())
The last query yields the following error in Spark 2.1.1:
org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint or int or bigint or float or double or decimal or timestamp or date or string or binary) type, not struct<city:string,person:string>;; 'Filter (p1#5 < p2#8) +- Join Cross :- Project [named_struct(city, city#0, person, person#1) AS p1#5] : +- LogicalRDD [city#0, person#1] +- Project [named_struct(city, city#0, person, person#1) AS p2#8] +- LogicalRDD [city#0, person#1]
Attachments
Issue Links
- links to