Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-31974

JobManager crashes after KubernetesClientException exception with FatalExitExceptionHandler

    XMLWordPrintableJSON

Details

    Description

      When resource quota limit is reached JobManager will throw

       

      org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is forbidden: exceeded quota: my-namespace-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.

       

      In 1.16.1 , this is handled gracefully:

      2023-04-28 22:07:24,631 WARN  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Failed requesting worker with resource spec WorkerResourceSpec \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb (241591914 bytes), numSlots=4}, current pending count: 0
      java.util.concurrent.CompletionException: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: my-namespace-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
              at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
              at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]
              at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown Source) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
              at java.lang.Thread.run(Unknown Source) ~[?:?]
      Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: my-namespace-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
              at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) ~[flink-dist-1.16.1.jar:1.16.1]
              at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) ~[flink-dist-1.16.1.jar:1.16.1]
              at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) ~[flink-dist-1.16.1.jar:1.16.1]
              at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) ~[flink-dist-1.16.1.jar:1.16.1]
              at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) ~[flink-dist-1.16.1.jar:1.16.1]
              at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) ~[flink-dist-1.16.1.jar:1.16.1]
              at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) ~[flink-dist-1.16.1.jar:1.16.1]
              at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) ~[flink-dist-1.16.1.jar:1.16.1]
              at io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) ~[flink-dist-1.16.1.jar:1.16.1]
              at org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) ~[flink-dist-1.16.1.jar:1.16.1]
              ... 4 more
      

      But , in Flink 1.17.0 , Job Manager crashes:

      2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler              [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15' produced an uncaught exception. Stopping the process...
      java.util.concurrent.CompletionException: org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is forbidden: exceeded quota: my-namespace-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
              at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
              at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]
              at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown Source) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
              at java.lang.Thread.run(Unknown Source) ~[?:?]
      Caused by: org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is forbidden: exceeded quota: my-namespace-resource-quota, requested: limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13.
              at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) ~[flink-dist-1.17.0.jar:1.17.0]
              at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) ~[flink-dist-1.17.0.jar:1.17.0]
              at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) ~[flink-dist-1.17.0.jar:1.17.0]
              at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) ~[flink-dist-1.17.0.jar:1.17.0]
              at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) ~[flink-dist-1.17.0.jar:1.17.0]
              at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) ~[flink-dist-1.17.0.jar:1.17.0]
              at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) ~[flink-dist-1.17.0.jar:1.17.0]
              at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) ~[flink-dist-1.17.0.jar:1.17.0]
              at org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) ~[flink-dist-1.17.0.jar:1.17.0]
              at org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) ~[flink-dist-1.17.0.jar:1.17.0]
              ... 4 more
      

      Attachments

        Issue Links

          Activity

            People

              gyfora Gyula Fora
              sergiosp Sergio Sainz
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: