Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-33728

Do not rewatch when KubernetesResourceManagerDriver watch fail

    XMLWordPrintableJSON

Details

    Description

      I met massive production problem when kubernetes ETCD slow responding happen. After Kube recoverd after 1 hour, Thousands of Flink jobs using kubernetesResourceManagerDriver rewatched when recieving ResourceVersionTooOld,  which caused great pressure on API Server and made API server failed again... 

       

      I am not sure is it necessary to

      getResourceEventHandler().onError(throwable)

      in  PodCallbackHandlerImpl# handleError method?

       

      We can just neglect the disconnection of watching process. and try to rewatch once new requestResource called. And we can leverage on the akka heartbeat timeout to discover the TM failure, just like YARN mode do.

      Attachments

        Issue Links

          Activity

            People

              zhoujira86 xiaogang zhou
              zhoujira86 xiaogang zhou
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: