[SPARK-47732] Ability for Spark to execute "embarassingly parallel" operations - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.5.1
Fix Version/s: None
Component/s: PySpark
Labels:
- features
- performance

Description

Consider a large operation. It applies over distinct values in a column, and each distinct value is a universe unto itself; but running the whole batch is certain to OOM.

This could be something like many geographies, item classes that are only comparable to their own type, or similar such things. Perhaps they cartesian join on themselves and so subsets are tractable but not the whole thing, but that's going to be dependent on use case.

Since the whole dataset is large, a join condition is intractable. Today, we might for-loop over the distinct values, and union the results that are persisted. However, that for-loop is sequential for no good reason at all.

I propose an API, something like `pyspark.sql.DataFrame.applyOver`, that takes three arguments:

A function that accepts `pyspark.sql.DataFrame` (namely, this one) and returns a `pyspark.sql.DataFrame`
A column or list of columns over which distinct values are found
The schema returned in (1)

This method would apply (1) to each filter of (2) in parallel over available executors, evaluate each result, and return the union of those results. Even a thoroughly naive implementation that persists the dataframes to shared storage and the resulting object is explicitly the union of those files would reduce overall execution time massively, to as little as the single longest subset time rather than the sum of all of them.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Philip Kahn

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Apr/24 18:03

Updated:: 04/Apr/24 18:03