Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46701

Spark Cluster Crashing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.4.0
    • None
    • Kubernetes, Spark Core
    • None
    • Apache Spark 3.4.0

       
       

    • Important

    Description

      I am getting these errors in the spark executors:

      2024-01-12 03:26:17.887 WARN     [task-result-getter-2]:org.apache.spark.internal.Logging - Lost task 65.0 in stage 79.0 (TID 10250) (10.1.208.60 executor 119): TaskKilled (Stage cancelled: Job aborted due to stage failure: Exception while getting task result: java.io.OptionalDataException)
      2024-01-12 03:26:17.891 WARN     [task-result-getter-3]:org.apache.spark.internal.Logging - Lost task 69.0 in stage 79.0 (TID 10263) (10.1.99.211 executor 72): TaskKilled (Stage cancelled: Job aborted due to stage failure: Exception while getting task result: java.io.OptionalDataException)
      2024-01-12 03:26:17.893 WARN     [task-result-getter-0]:org.apache.spark.internal.Logging - Lost task 115.0 in stage 79.0 (TID 10202) (10.1.236.96 executor 27): TaskKilled (Stage cancelled: Job aborted due to stage failure: Exception while getting task result: java.io.OptionalDataException)
      2024-01-12 03:26:17.895 WARN     [task-result-getter-1]:org.apache.spark.internal.Logging - Lost task 4.0 in stage 79.0 (TID 10231) (10.1.165.84 executor 80): TaskKilled (Stage cancelled: Job aborted due to stage failure: Exception while getting task result: java.io.OptionalDataException)
      2024-01-12 03:26:17.897 WARN     [task-result-getter-2]:org.apache.spark.internal.Logging - Lost task 75.0 in stage 79.0 (TID 10228) (10.1.6.211 executor 18): TaskKilled (Stage cancelled: Job aborted due to stage failure: Exception while getting task result: java.io.OptionalDataException)
      2024-01-12 03:26:17.902 WARN     [task-result-getter-3]:org.apache.spark.internal.Logging - Lost task 102.0 in stage 79.0 (TID 10285) (10.1.160.108 executor 53): TaskKilled (Stage cancelled: Job aborted due to stage failure: Exception while getting task result: java.io.OptionalDataException)
      2024-01-12 03:27:13.092 ERROR    [dispatcher-CoarseGrainedScheduler]:org.apache.spark.internal.Logging - Lost executor 117 on 10.1.197.197: 
      The executor with id 117 exited with exit code 50(Uncaught exception).
       
       
       
      The API gave the following container statuses:
       
       
      container name: spark-kubernetes-executor
      container image: ngxp-registry.service.lab.ngxp.cci.att.com:5000/nova/midlayer/midlayer-streaming-core:1.7.1
      container state: terminated
      container started at: 2024-01-12T03:03:46Z
      container finished at: 2024-01-12T03:27:12Z
      exit code: 50
      termination reason: Error
            
      2024-01-12 03:27:13.095 WARN     [dispatcher-CoarseGrainedScheduler]:org.apache.spark.internal.Logging - Lost task 79.0 in stage 79.0 (TID 10305) (10.1.197.197 executor 117): ExecutorLostFailure (executor 117 exited caused by one of the running tasks) Reason: 
      The executor with id 117 exited with exit code 50(Uncaught exception).

      Attachments

        Activity

          People

            Unassigned Unassigned
            husbal1 Hussein Ballout
            Mohamad Haidar Mohamad Haidar
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 72h
                72h
                Remaining:
                Remaining Estimate - 72h
                72h
                Logged:
                Time Spent - Not Specified
                Not Specified