Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25299 Use remote storage for persisting shuffle data
  3. SPARK-42689

Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.5.0
    • 3.5.0
    • Spark Core
    • None

    Description

      Currently, if there is an executor node loss, we assume the shuffle data on that node is also lost. This is not necessarily the case if there is a shuffle component managing the shuffle data and reliably maintaining it (for example, in distributed filesystem or in a disaggregated shuffle cluster).

      Downstream projects have patches to Apache Spark in order to workaround this issue, for example Apache Celeborn has this.

      Attachments

        Activity

          People

            mridulm80 Mridul Muralidharan
            mridulm80 Mridul Muralidharan
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: