[SPARK-47917] Accounting the impact of failures in spark jobs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Question
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.5.1
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

Hello,

In my organization, we have an accounting system for spark jobs that uses the task execution time to determine how much time a spark job uses the executors for and we use it as a way to segregate cost. We sum all the task times per job and apply proportions. Our clusters follow a 1 task per core model & this works well.

A job goes through several failures during its run, due to executor failure, node failure ( spot interruptions ), and spark retries tasks & sometimes entire stages.

We now want to account for this failure and determine what % of a job's total task time is due to these retries. Basically, if a job with failures & retries has a total task time of X, there is a X' representing the goodput of this job – i.e. a hypothetical run of the job with 0 failures & retries. In this case, ( X-X' ) / X quantifies the cost of failures.

This form of accounting requires tracking execution history of each task i.e. tasks that compute the same logical partition of some RDD. This was quite easy with AQE disabled as stage ids never changed, but with AQE enabled that's no longer the case.

Do you have any suggestions on how I can use the Spark event system?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Faiz Halde

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Apr/24 11:16

Updated:: 19/Apr/24 11:16