Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-43987

Separate finalizeShuffleMerge Processing to Dedicated Thread Pools

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.2.0, 3.4.0
    • 4.0.0
    • Shuffle
    • None

    Description

      In our production environment, finalizeShuffleMerge processing took longer time (p90 is around 20s) than other PRC requests. This is due to finalizeShuffleMerge invoking IO operations like truncate and file open/close.  

      More importantly, processing this finalizeShuffleMerge can block other critical lightweight messages like authentications, which can cause authentication timeout as well as fetch failures. Those timeout and fetch failures affect the stability of the Spark job executions. 

      Attachments

        Activity

          People

            shuwang SHU WANG
            shuwang SHU WANG
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: