Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28483

Canceling a spark job using barrier mode but barrier tasks do not exit

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.3
    • Fix Version/s: 3.0.0
    • Component/s: Spark Core
    • Labels:
      None

      Description

      Reproduce code:

      import time
      from pyspark import BarrierTaskContext
      
      n = 4
      
      def  task(x):
        context = BarrierTaskContext.get()
        this = next(x)
        if (this % 2 == 0):
          time.sleep(10000)
        context.barrier()
        return []
      
      sc.setLogLevel("INFO")
      sc.parallelize(list(range(n)), n).barrier().mapPartitions(task).collect()

      Run above code in pyspark shell and then print Ctrl + C to exit the job.

      Get logging like:

      19/02/05 01:07:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) has entered the global sync, current barrier epoch is 0.
      19/02/05 01:07:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered the global sync, current barrier epoch is 0.
      19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
      19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
      19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
      19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
      19/02/05 01:07:50 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) interrupted: Attempting to kill Python Worker
      19/02/05 01:07:50 INFO Executor: Executor killed task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
      19/02/05 01:07:50 WARN PythonRunner: Incomplete task 3.3 in stage 0 (TID 3) interrupted: Attempting to kill Python Worker
      19/02/05 01:07:50 INFO Executor: Executor killed task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
      19/02/05 01:07:50 WARN PythonRunner: Incomplete task 2.2 in stage 0 (TID 2) interrupted: Attempting to kill Python Worker
      19/02/05 01:07:50 INFO Executor: Executor killed task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
      19/02/05 01:07:50 WARN PythonRunner: Incomplete task 1.1 in stage 0 (TID 1) interrupted: Attempting to kill Python Worker
      19/02/05 01:07:50 INFO Executor: Executor killed task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
      19/02/05 01:08:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 60 seconds, current barrier epoch is 0.
      19/02/05 01:08:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 60 seconds, current barrier epoch is 0.
      19/02/05 01:09:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 120 seconds, current barrier epoch is 0.
      19/02/05 01:09:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 120 seconds, current barrier epoch is 0.
      19/02/05 01:10:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 180 seconds, current barrier epoch is 0.
      19/02/05 01:10:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 180 seconds, current barrier epoch is 0.
      19/02/05 01:11:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 240 seconds, current barrier epoch is 0.
      19/02/05 01:11:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 240 seconds, current barrier epoch is 0.
      

        Attachments

          Activity

            People

            • Assignee:
              WeichenXu123 Weichen Xu
              Reporter:
              WeichenXu123 Weichen Xu
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: