Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38309

SHS has incorrect percentiles for shuffle read bytes and shuffle total blocks metrics

    XMLWordPrintableJSON

Details

    Description

      Background

      In this PR (SPARK-26260) the SHS stage metric percentiles were updated to only include successful tasks when using disk storage. It did this by making the values for each metric negative when the task is not in a successful state. This approach was chosen to avoid breaking changes to disk storage. See this comment for context.

      To get the percentiles, it reads the metric values, starting at 0, in ascending order. This filters out all tasks that are not successful because the values are less than 0. To get the percentile values it scales the percentiles to the list index of successful tasks. For example if there are 200 tasks and you want percentiles [0, 25, 50, 75, 100] the lookup indexes in the task collection are [0, 50, 100, 150, 199].

      Issue
      For metrics 1) shuffle total reads and 2) shuffle total blocks, the above PR incorrectly makes the metric indices positive. This means tasks that are not successful are included in the percentile calculations. The percentile lookup index calculation is still based on the number of successful task so the wrong task metric is returned for a given percentile. This was not caught because the unit test only verified values for one metric, executorRunTime.

      Steps to Reproduce
      SHS UI

      1. Find a spark application in the SHS that has failed tasks for a stage with shuffle read.
      2. Navigate to the stage UI.
      3. Look at the max shuffle read size in the summary metrics
      4. Sort the tasks by shuffle read size descending. You'll see it doesn't match step 3.

       

      API

      1. For the same stage in the above repro steps, make a request to the task summary endpoint (e.g. /api/v1/applications/application_1632281309592_21294517/1/stages/6/0/taskSummary?quantiles=0,0.25,0.5,0.75,1.0)
      2. Look at the shuffleReadMetrics.readBytes and shuffleReadMetrics.totalBlocksFetched. You will see -2 for at least some of the lower percentiles and the positive values will also be wrong.

      Attachments

        Activity

          People

            robreeves Rob Reeves
            robreeves Rob Reeves
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: