Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11622

ResourceManager asynchronous switch from Standy to Active exception

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0-alpha4, 3.1.1, 3.3.0
    • None
    • resourcemanager

    Description

      Two exception cases:

      The first case:

      The exception desc:

      14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread
      java.lang.NullPointerException
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
      at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
      at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
      at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
      at java.lang.Thread.run(Thread.java:748){{}} * 

       

      • ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 14:52:57,

      Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.

      • As shown in the following figure, Thread_1 during the toStandby process , reinitializes the activeServices to null. At this point, Thread_2 will use the "activeServices" when executing the handleTransitionToStandByInNewThread method ultimately resulting in a NullPointerException and the Reosurcemanager server exit.

      The second case:

      The exception desc:

      06:17:35,913 WARN ha.ActiveStandbyElector (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the winning of election
      org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
      at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
      at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
      at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
      at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
      at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
      Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll during transition to Active
      at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
      at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
      ... 4 more
      Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed
      at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
      at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
      ... 5 more
      Caused by: java.lang.NullPointerException
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
      at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
      at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
      ... 6 more
      06:17:35,917 ERROR resourcemanager.ResourceManager (ResourceManager.java:handle(898)) - Received RMFatalEvent of type TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
      tion failed{{}} 
      • ActiveStandbyElector and ZKRMStateStore triggered toActive event and toStandby event at 06:17:35, Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
      • During the execution of Thread_ 1 the CapacityScheduler.reinitialize is called to refresh the Scheduler configuration. At this time, the csConfProvider property of the CapacityScheduler is not initialized and its value is null. As a result. when the reinitialize method is executed csConfProvider is used, triggering a NullPointerException and causing Thread_ 1 transition to active fail.

      Solution

      Due to the limited scope of lock control in ResourceMmanger’s transitionToActive and transitionToStandby methods, different events triggered asynchronously outside this lock scope can influence each other, leading to unpredictable issues. The proposed solution is to encapsulate different asynchronous tasks as TransitionToActiveStandbyRunner and enqueue them in a queue to be executed in order by a SingleThreadExecutor. This approach resolves the asynchronous problem and provides clearer and more controllable switching of to active and standby processes.

      TransitionToActiveStandbyRunner and Subclasses

      TransitionToActiveStandbyRunner

       TransitionToActiveStandbyRunner is a template class where the logic for different scenarios is placed and executed within the doTransaction method.

      public abstract class TransitionToActiveStandbyRunner implements  Callable<TransitionToActiveStandbyResult> {    @Override
          public TransitionToActiveStandbyResult call() throws Exception {
              ... before log ...
           TransitionToActiveStandbyResult result = doTransaction();
              ... after log ...
              return result;
          }    public abstract  TransitionToActiveStandbyResult  doTransaction();}

      Subclasses

      AdminServiceToActiveRunner

      AdminServiceToActiveRunner encapsulates the logic of the transitionToActive method in AdminService, handling the requests from clients and ActiveStandbyElector to transition to the active state.

      AdminServiceToStandbyRunner

      AdminServiceToStandbyRunner encapsulates the logic of the transitionToStandby method in AdminService, handling the requests from clients and ActiveStandbyElector to transition to the standby state.

      RmStartAndStopToStandby

      RmStartAndStopToStandby is used for transitioning the ResourceManager service to standby when it is stopping or starting

       

      RMStartToActiveRunner

      RMStartToActiveRunner is used for transitioning the ResourceManager service to active when it is stopping.

       

      RMFatalToStandbyRunner

      RMFatalToStandbyRunner is used to handle RMFatalEvent during Yarn open HA mode for transitioning to standby.

      Attachments

        1. rm_ha_solution.png
          124 kB
          wangzhihui
        2. yuque_diagram.jpg
          157 kB
          wangzhihui
        3. yuque_diagram (1).jpg
          217 kB
          wangzhihui

        Issue Links

          Activity

            People

              hiwangzhihui wangzhihui
              hiwangzhihui wangzhihui
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: