Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18358

Multiple Aggregation Using 'countDistinct' and 'first' result in error

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • 2.0.2
    • None
    • None
    • Mac OS X 10.9.5
      Apache Spark 2.0.1
      Hadoop 1.4

    Description

      Using pyspark, when I attempt to perform multiple aggregations on the same groupBy object using the functions 'first' and 'countDistinct' it results in a Py4JJavaError.

      from pyspark.sql import SparkSession
      import pyspark.sql.functions as sfn
      
      sparkSession = SparkSession.builder.master('local').getOrCreate()
      
      df = spark.createDataFrame([
              (1, 'a', 'z'),
              (1, 'b', 'x'),
              (1, 'a', 'y'),
              (1, 'a', 'x'),
              (2, 'b', 'z'),
              (2, 'b', 'z')
          ], ['id', 'var1', 'var2'])
      
      ## Using two 'first' and one 'countDistinct' aggregations works
      df.groupby('id')    \
              .agg(sfn.first('var1'),  \
                      sfn.first('var2'),  \
                      sfn.countDistinct('var1')).show()
                               
      ## Using one 'max' with both 'countDistinct' works:
      df.groupby('id')    \
               .agg(sfn.max('var2'),                \
                       sfn.countDistinct('var1'),   \
                       sfn.countDistinct('var2')).show()
      
      ## But using both 'countDistinct' with at least one 'first' crashes
      df.groupby('id')    \
              .agg(sfn.first('var1'),   \
                      sfn.first('var2'),   \
                      sfn.countDistinct('var1'), \
                      sfn.countDistinct('var2')) \
              .show()
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              nasrallah Chris Nasrallah
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: