Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47732

Ability for Spark to execute "embarassingly parallel" operations

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.1
    • None
    • PySpark

    Description

      Consider a large operation. It applies over distinct values in a column, and each distinct value is a universe unto itself; but running the whole batch is certain to OOM.

      This could be something like many geographies, item classes that are only comparable to their own type, or similar such things. Perhaps they cartesian join on themselves and so subsets are tractable but not the whole thing, but that's going to be dependent on use case.

      Since the whole dataset is large, a join condition is intractable. Today, we might for-loop over the distinct values, and union the results that are persisted. However, that for-loop is sequential for no good reason at all.

       

      I propose an API, something like `pyspark.sql.DataFrame.applyOver`, that takes three arguments:

      1. A function that accepts `pyspark.sql.DataFrame` (namely, this one) and returns a `pyspark.sql.DataFrame`
      2. A column or list of columns over which distinct values are found
      3. The schema returned in (1)

      This method would apply (1) to each filter of (2) in parallel over available executors, evaluate each result, and return the union of those results. Even a thoroughly naive implementation that persists the dataframes to shared storage and the resulting object is explicitly the union of those files would reduce overall execution time massively, to as little as the single longest subset time rather than the sum of all of them.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tigerhawkvok Philip Kahn
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: