Details
Description
When using multiple columns in the orderBy of a WindowSpec the order by seems to work only for the first column.
A possible workaround is to sort previosly the DataFrame and then apply the window spec over the sorted DataFrame
e.g.
THIS NOT WORKS:
window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', 'mib_id', 'day').rowsBetween(-sys.maxsize, 0)
df = df.withColumn('user_version', func.sum(df.group_counter).over(window_sum))
THIS WORKS WELL:
df = df.sort('user_unique_id', 'creation_date', 'mib_id', 'day')
window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', 'mib_id', 'day').rowsBetween(-sys.maxsize, 0)
df = df.withColumn('user_version', func.sum(df.group_counter).over(window_sum))
Also, can anybody confirm that this is a true workaround?
Attachments
Issue Links
- is duplicated by
-
SPARK-11009 RowNumber in HiveContext returns negative values in cluster mode
- Resolved