Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44767

Plugin API for PySpark and SparkR workers

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.4.1
    • None
    • Spark Core

    Description

      An API to customize Python and R workers allows for extensibility beyond what can be expressed via static configs and environment variables like, e.g., spark.pyspark.python.

      A use case for this is overriding PATH when using spark.archives with, say, conda-pack (as documented here). Some packages rely on binaries. And if we want to use those packages in Spark, we need to include their binaries in the PATH.

      But we can't set the PATH via some config because 1) the environment with its binaries may be at a dynamic location (archives are unpacked on the driver into a directory with random name), and 2) we may not want to override the PATH that's pre-configured on the hosts.

      Other use cases unlocked by this include overriding the executable dynamically (e.g., to select a version) or forking/redirecting the worker's output stream.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rshkv Willi Raschkowski
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: