Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-31652

Flink should handle the delete event if the pod was deleted while pending

    XMLWordPrintableJSON

Details

    Description

      I found that in kubernetes deployment, if the taskmanager pod is deleted in 'Pending' phase, the flink job will get stuck and keep waiting for the pod scheduled. We can reproduce this issue with the 'kubectl delete pod' command to delete the pod when it is in the pending phase.
       
      The cause reason is that the pod status will not be updated in time in this case, so the KubernetesResourceManagerDriver won't detect the pod is terminated, and I also verified this by logging the pod status in KubernetesPod#isTerminated(), and it shows as follows.

      public boolean isTerminated() {
          log.info("pod status: " + getInternalResource().getStatus());
          if (getInternalResource().getStatus() != null) {
              final boolean podFailed =
                      PodPhase.Failed.name().equals(getInternalResource().getStatus().getPhase());
              final boolean containersFailed =
                      getInternalResource().getStatus().getContainerStatuses().stream()
                              .anyMatch(
                                      e ->
                                              e.getState() != null
                                                      && e.getState().getTerminated() != null);
              return containersFailed || podFailed;
          }
          return false;
      } 

      In the case, this function will return false because `containersFailed` and `podFailed` are both false.

      PodStatus(conditions=[PodCondition(lastProbeTime=null, lastTransitionTime=2023-03-28T12:35:10Z, reason=Unschedulable, status=False, type=PodScheduled, additionalProperties={})], containerStatuses=[], ephemeralContainerStatuses=[], hostIP=null, initContainerStatuses=[], message=null, nominatedNodeName=null, phase=Pending, podIP=null, podIPs=[], qosClass=Guaranteed, reason=null, startTime=null, additionalProperties={})

       
       

      Attachments

        Issue Links

          Activity

            People

              xiasun xingbe
              xiasun xingbe
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: