Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46310

Cannot deploy Spark application using VolcanoFeatureStep to specify podGroupTemplate file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.4.1
    • None
    • Kubernetes
    • None

    Description

      I'm trying to deploy a Spark application (version 3.4.1) on Kubernetes using Volcano as the scheduler. I define a VolcanoJob that represents the Spark driver - it has only one task, whose pod specification includes the driver container, which invokes the spark-submit command.

      Following the official Spark documentation (available on "Using Volcano as Customized Scheduler for Spark on Kubernetes"), I define the necessary configuration parameters to make use of Volcano as the scheduler for my Spark workload:

      /opt/spark/bin/spark-submit --name "volcano-spark-1" --deploy-mode="client" \
      --class "org.apache.spark.examples.SparkPi" \
      --conf spark.executor.instances="1" \
      --conf spark.kubernetes.driver.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep" \
      --conf spark.kubernetes.executor.pod.featureSteps="org.apache.spark.deploy.k8s.features.VolcanoFeatureStep" \
      --conf spark.kubernetes.scheduler.volcano.podGroupTemplateFile="/var/template/podgroup.yaml" \
      file:///opt/spark/examples/jars/spark-examples_2.12-3.4.1.jar
      

      In the block above, I omitted some Kubernetes configuration parameters that aren't important for this example. The parameter spark.kubernetes.scheduler.volcano.podGroupTemplateFile points to a file mounted in the driver container. It has a content just as the following example (cpu / memory values may vary):

      apiVersion: scheduling.volcano.sh/v1beta1
      kind: PodGroup
      metadata: 
        name: pod-group-test
      spec: 
        minResources: 
          cpu: "2"
          memory: "2Gi"
        queue: some-existing-queue
      

      I manually verified that this file "/var/template/podgroup.yaml" exists in the container before the "spark-submit" command is issued. I also granted all the necessary RBAC permissions so that the driver pod can interact with Kubernetes objects (pods, VolcanoJobs, podgroups, queues, etc.).

      When I execute this VolcanoJob, I see only the driver pod being created, and when inspecting its logs, I see the following error:

      io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://api.<masked-environment-endpoint>/api/v1/namespaces/04522055-15b3-40d8-ba07-22b1a2a5ffcc/pods. Message: admission webhook "validatepod.volcano.sh" denied the request: failed to get PodGroup for pod <04522055-15b3-40d8-ba07-22b1a2a5ffcc/volcano-spark-1-driver-0-exec-789>: podgroups.scheduling.volcano.sh "spark-5ad570e340934d3997065fa6d504910e-podgroup" not found. Received status: Status(apiVersion=v1, code=400, details=null, kind=Status, message=admission webhook "validatepod.volcano.sh" denied the request: failed to get PodGroup for pod <04522055-15b3-40d8-ba07-22b1a2a5ffcc/volcano-spark-1-driver-0-exec-789>: podgroups.scheduling.volcano.sh "spark-5ad570e340934d3997065fa6d504910e-podgroup" not found, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, status=Failure, additionalProperties={}).
      	at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
      	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:538)
      	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:558)
      	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:349)
      	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:711)
      	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:93)
      	at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
      	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1113)
      	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:93)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:440)
      	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:417)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:370)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:363)
      	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
      	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
      	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:363)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:134)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocator.scala:134)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber$$processSnapshotsInternal(ExecutorPodsSnapshotsStoreImpl.scala:143)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorPodsSnapshotsStoreImpl.scala:131)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:85)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:182)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:296)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:838)
      

      The error seems to be triggered when the driver attempts to deploy the executors of my Spark application. In the error message, it says that the podGroup "spark-5ad570e340934d3997065fa6d504910e-podgroup" cannot be found (pointed out by the Volcano admission hook).

      I was expecting that the driver and executors would be assigned to the same PodGroup object, created by the VolcanoFeatureStep using the template file that I provided through the configuration parameter "spark.kubernetes.scheduler.volcano.podGroupTemplateFile". With that, I would have a proper batch scheduling of my Spark application, as driver and executor pods would reside in the same pod group, and would be scheduled together by Volcano. But instead, only the driver pod is deployed, and the error seen above is found on its logs.

      The documentation "Using Volcano as Customized Scheduler for Spark on Kubernetes" leads me to understand that by providing the PodGroup template file, my Spark application (i.e., driver and executors) would be allocated in the same PodGroup object, following the specification I provided. That doesn't seem to be the case, and it looks like the PodGroup isn't created following the provided template, nor can the executors be created.

      Some more details about the environment I used:

      • Volcano Version: v1.8.0
      • Spark Version: 3.4.1
      • Kubernetes version: v1.26.7
      • Cloud provider: GCP

      Attachments

        Activity

          People

            Unassigned Unassigned
            lsbx96_ Lucca Sergi
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: