Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17863

SELECT distinct does not work if there is a order by clause

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 2.0.2, 2.1.0
    • SQL

    Description

      select distinct struct.a, struct.b
      from (
        select named_struct('a', 1, 'b', 2, 'c', 3) as struct
        union all
        select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
      order by struct.a, struct.b
      

      This query generates

      +---+---+
      |  a|  b|
      +---+---+
      |  1|  2|
      |  1|  2|
      +---+---+
      

      The plan is wrong because the analyze somehow added struct#21805 to the project list, which changes the semantic of the distinct (basically, the query is changed to select distinct struct.a, struct.b, struct from select distinct struct.a, struct.b).

      == Parsed Logical Plan ==
      'Sort ['struct.a ASC, 'struct.b ASC], true
      +- 'Distinct
         +- 'Project ['struct.a, 'struct.b]
            +- 'SubqueryAlias tmp
               +- 'Union
                  :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
                  :  +- OneRowRelation$
                  +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
                     +- OneRowRelation$
      
      == Analyzed Logical Plan ==
      a: int, b: int
      Project [a#21819, b#21820]
      +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
         +- Distinct
            +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, struct#21805]
               +- SubqueryAlias tmp
                  +- Union
                     :- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
                     :  +- OneRowRelation$
                     +- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
                        +- OneRowRelation$
      
      == Optimized Logical Plan ==
      Project [a#21819, b#21820]
      +- Sort [struct#21805.a ASC, struct#21805.b ASC], true
         +- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, struct#21805]
            +- Union
               :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
               :  +- OneRowRelation$
               +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
                  +- OneRowRelation$
      
      == Physical Plan ==
      *Project [a#21819, b#21820]
      +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
         +- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
            +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], output=[a#21819, b#21820, struct#21805])
               +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
                  +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], output=[a#21819, b#21820, struct#21805])
                     +- Union
                        :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
                        :  +- Scan OneRowRelation[]
                        +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
                           +- Scan OneRowRelation[]
      

      If you use the following query, you will get the correct result

      select distinct struct.a, struct.b
      from (
        select named_struct('a', 1, 'b', 2, 'c', 3) as struct
        union all
        select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
      order by a, b
      

      Attachments

        Activity

          People

            davies Davies Liu
            yhuai Yin Huai
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: