Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11292

resourcemanager no longer reconnects to zk

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.3
    • None
    • resourcemanager
    • None

    Description

      this problem has occurred in our environment ,the process of the problem is as follow:

      1. network exception between resourcemanager and zookeeper
      2. resourcemanger reconnect zookeeper successful
      3. zookeeper session expire occurred
      4. resourcemanager create new zookeeper client and reconnect it
      5. if reconnect zk failed,will trigger RMFatalEvent
      6. then start new thread to continue reconnect and rejoin election,while the variable  hasAlreadyRun controll just run once,so if still reconnect failed,there have no chance to reconnect
          private class StandByTransitionRunnable implements Runnable {
            // The atomic variable to make sure multiple threads with the same runnable
            // run only once.
            private final AtomicBoolean hasAlreadyRun = new AtomicBoolean(false);      @Override
            public void run() {
              // Run this only once, even if multiple threads end up triggering
              // this simultaneously.
              if (hasAlreadyRun.getAndSet(true)) {
                return;
              }        if (rmContext.isHAEnabled()) {
                try {
                  // Transition to standby and reinit active services
                  LOG.info("Transitioning RM to Standby mode");
                  transitionToStandby(true);
                  EmbeddedElector elector = rmContext.getLeaderElectorService();
                  if (elector != null) {
                    elector.rejoinElection();
                  }
                } catch (Exception e) {
                  LOG.error(FATAL, "Failed to transition RM to Standby mode.", e);
                  ExitUtil.terminate(1, e);
                }
              }
            }
          } 

      so, i think use a lock here will be better

      Attachments

        Activity

          People

            Unassigned Unassigned
            hncscwc chenwencan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: