Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-7712 Window Function Improvements
  3. SPARK-8712

Hive's Parser does not support distinct aggregations with OVER clause

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 1.4.0
    • None
    • SQL
    • None

    Description

      Hive's parser ignores Window spec when a distinct aggregation is used.

      scala> Seq((1, 2, 3)).toDF("i", "j", "k").registerTempTable("t")
      
      scala> sql("select count(distinct j) over (partition by i) from t").explain(true)
      == Parsed Logical Plan ==
      'Project [UnresolvedAlias
       CountDistinct
        UnresolvedAttribute [j]
      ]
       'UnresolvedRelation [t], None
      
      == Analyzed Logical Plan ==
      _c0: bigint
      Aggregate [COUNT(DISTINCT j#23) AS _c0#27L]
       Subquery t
        Project [_1#19 AS i#22,_2#20 AS j#23,_3#21 AS k#24]
         LocalRelation [_1#19,_2#20,_3#21], [[1,2,3]]
      
      == Optimized Logical Plan ==
      Aggregate [COUNT(DISTINCT j#23) AS _c0#27L]
       LocalRelation [j#23], [[2]]
      
      == Physical Plan ==
      GeneratedAggregate false, [CombineAndCount(partialSets#28) AS _c0#27L], false
       Exchange SinglePartition
        GeneratedAggregate true, [AddToHashSet(j#23) AS partialSets#28], false
         LocalTableScan [j#23], [[2]]
      
      Code Generation: true
      == RDD ==
      
      scala> sql("select count(j) over (partition by i) from t").explain(true)
      == Parsed Logical Plan ==
      'Project [UnresolvedAlias
       WindowExpression
        UnresolvedWindowFunction count
         UnresolvedAttribute [j]
        WindowSpecDefinition UnspecifiedFrame
         UnresolvedAttribute [i]
      ]
       'UnresolvedRelation [t], None
      
      == Analyzed Logical Plan ==
      _c0: bigint
      Project [_c0#31L]
       Project [j#23,i#22,_c0#31L,_c0#31L]
        Window [j#23,i#22], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(j#23) WindowSpecDefinition ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS _c0#31L], WindowSpecDefinition ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
         Project [j#23,i#22]
          Subquery t
           Project [_1#19 AS i#22,_2#20 AS j#23,_3#21 AS k#24]
            LocalRelation [_1#19,_2#20,_3#21], [[1,2,3]]
      
      == Optimized Logical Plan ==
      Project [_c0#31L]
       Window [j#23,i#22], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(j#23) WindowSpecDefinition ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS _c0#31L], WindowSpecDefinition ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        LocalRelation [j#23,i#22], [[2,1]]
      
      == Physical Plan ==
      Project [_c0#31L]
       Window [j#23,i#22], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(j#23) WindowSpecDefinition ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS _c0#31L], WindowSpecDefinition ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        ExternalSort [i#22 ASC], false
         Exchange (HashPartitioning 200)
          LocalTableScan [j#23,i#22], [[2,1]]
      
      Code Generation: true
      == RDD ==
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yhuai Yin Huai
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: