Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-33998

Flink Job Manager restarted after kube-apiserver connection intermittent

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 1.13.6
    • None
    • None
    • Kubernetes 1.24

      Flink Operator 1.4

      Flink 1.13.6

    Description

      We are running Flink on AWS EKS and experienced Job Manager restart issue when EKS control plane scaled up/in.

      I can reproduce this issue in my local environment too.

      Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by my own with below setup:

      • Two kube-apiserver, only one is running at a time;
      • Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
      • Enable Flink Job Manager HA;
      • Configure Job Manager leader election timeout;
      high-availability.kubernetes.leader-election.lease-duration: "60s"
      high-availability.kubernetes.leader-election.renew-deadline: "60s"

      For testing, I switch the running kube-apiserver from one instance to another each time. When the kube-apiserver is switching, I can see that some Job Managers restart, but some are still running normally.

      Here is an example. When kube-apiserver swatched over at 05:53:08, both JM lost connection to kube-apiserver. But there is no more connection error within a few seconds. I guess the connection recovered by retry.

      However, one of the JM (the 2nd one in the attached screen shot) reported "DefaultDispatcherRunner was revoked the leadership" error after the leader election timeout (at 05:54:08) and then restarted itself. While the other JM was still running normally.

      From kube-apiserver audit logs, the normal JM was able to renew leader lease after the interruption. But there is no any lease renew request from the failed JM until it restarted.

       

      Attachments

        1. jm-restart4.log
          161 kB
          Xiangyan
        2. jm-no-restart4.log
          204 kB
          Xiangyan
        3. connection timeout.png
          207 kB
          Xiangyan
        4. audit-log-restart.txt
          8 kB
          Xiangyan
        5. audit-log-no-restart.txt
          6 kB
          Xiangyan

        Activity

          People

            Unassigned Unassigned
            xiangyan Xiangyan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: