[SPARK-21110] Structs should be usable in inequality filters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.1.1
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Target Version/s:

2.3.0

Description

It seems like a missing feature that you can't compare structs in a filter on a DataFrame.

Here's a simple demonstration of a) where this would be useful and b) how it's different from simply comparing each of the components of the structs.

import pyspark
from pyspark.sql.functions import col, struct, concat

spark = pyspark.sql.SparkSession.builder.getOrCreate()

df = spark.createDataFrame(
    [
        ('Boston', 'Bob'),
        ('Boston', 'Nick'),
        ('San Francisco', 'Bob'),
        ('San Francisco', 'Nick'),
    ],
    ['city', 'person']
)
pairs = (
    df.select(
        struct('city', 'person').alias('p1')
    )
    .crossJoin(
        df.select(
            struct('city', 'person').alias('p2')
        )
    )
)

print("Everything")
pairs.show()

print("Comparing parts separately (doesn't give me what I want)")
(pairs
    .where(col('p1.city') < col('p2.city'))
    .where(col('p1.person') < col('p2.person'))
    .show())

print("Comparing parts together with concat (gives me what I want but is hacky)")
(pairs
    .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
    .show())

print("Comparing parts together with struct (my desired solution but currently yields an error)")
(pairs
    .where(col('p1') < col('p2'))
    .show())

The last query yields the following error in Spark 2.1.1:

org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint or int or bigint or float or double or decimal or timestamp or date or string or binary) type, not struct<city:string,person:string>;;
'Filter (p1#5 < p2#8)
+- Join Cross
   :- Project [named_struct(city, city#0, person, person#1) AS p1#5]
   :  +- LogicalRDD [city#0, person#1]
   +- Project [named_struct(city, city#0, person, person#1) AS p2#8]
      +- LogicalRDD [city#0, person#1]

Attachments

Issue Links

links to

[Github] Pull Request #18818 (aray)

Activity

People

Assignee:: Andrew Ray

Reporter:: Nicholas Chammas

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Jun/17 19:02

Updated:: 12/Dec/22 18:10

Resolved:: 31/Aug/17 22:08