Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21110

Structs should be usable in inequality filters

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.1.1
    • 2.3.0
    • SQL
    • None

    Description

      It seems like a missing feature that you can't compare structs in a filter on a DataFrame.

      Here's a simple demonstration of a) where this would be useful and b) how it's different from simply comparing each of the components of the structs.

      import pyspark
      from pyspark.sql.functions import col, struct, concat
      
      spark = pyspark.sql.SparkSession.builder.getOrCreate()
      
      df = spark.createDataFrame(
          [
              ('Boston', 'Bob'),
              ('Boston', 'Nick'),
              ('San Francisco', 'Bob'),
              ('San Francisco', 'Nick'),
          ],
          ['city', 'person']
      )
      pairs = (
          df.select(
              struct('city', 'person').alias('p1')
          )
          .crossJoin(
              df.select(
                  struct('city', 'person').alias('p2')
              )
          )
      )
      
      print("Everything")
      pairs.show()
      
      print("Comparing parts separately (doesn't give me what I want)")
      (pairs
          .where(col('p1.city') < col('p2.city'))
          .where(col('p1.person') < col('p2.person'))
          .show())
      
      print("Comparing parts together with concat (gives me what I want but is hacky)")
      (pairs
          .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
          .show())
      
      print("Comparing parts together with struct (my desired solution but currently yields an error)")
      (pairs
          .where(col('p1') < col('p2'))
          .show())
      

      The last query yields the following error in Spark 2.1.1:

      org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint or int or bigint or float or double or decimal or timestamp or date or string or binary) type, not struct<city:string,person:string>;;
      'Filter (p1#5 < p2#8)
      +- Join Cross
         :- Project [named_struct(city, city#0, person, person#1) AS p1#5]
         :  +- LogicalRDD [city#0, person#1]
         +- Project [named_struct(city, city#0, person, person#1) AS p2#8]
            +- LogicalRDD [city#0, person#1]
      

      Attachments

        Activity

          People

            a1ray Andrew Ray
            nchammas Nicholas Chammas
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: