Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47917

Accounting the impact of failures in spark jobs

    XMLWordPrintableJSON

Details

    • Question
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.5.1
    • None
    • Spark Core
    • None

    Description

      Hello,
       
      In my organization, we have an accounting system for spark jobs that uses the task execution time to determine how much time a spark job uses the executors for and we use it as a way to segregate cost. We sum all the task times per job and apply proportions. Our clusters follow a 1 task per core model & this works well.
       
      A job goes through several failures during its run, due to executor failure, node failure ( spot interruptions ), and spark retries tasks & sometimes entire stages.
       
      We now want to account for this failure and determine what % of a job's total task time is due to these retries. Basically, if a job with failures & retries has a total task time of X, there is a X' representing the goodput of this job – i.e. a hypothetical run of the job with 0 failures & retries. In this case, ( X-X' ) / X quantifies the cost of failures.
       
      This form of accounting requires tracking execution history of each task i.e. tasks that compute the same logical partition of some RDD. This was quite easy with AQE disabled as stage ids never changed, but with AQE enabled that's no longer the case. 
       
      Do you have any suggestions on how I can use the Spark event system?

      Attachments

        Activity

          People

            Unassigned Unassigned
            haldefaiz Faiz Halde
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: