Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11656

RMStateStore event queue blocked

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 3.4.1
    • None
    • yarn

    Description

      Problem statement

      I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck).
      Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore).

      To reduce the impact of the issue:

      • create a dispatcher where events can persist in parallel threads
      • create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster
      Issue visible on UI2

      Also another way to identify the issue if we can see too much time is required to store info for app after reach new_saving state

      How issue can look like in log

      Solution

      Created a MultiDispatcher class which implements the Dispatcher interface.
      The Dispatcher creates a separate metric object called Event metrics for "rm-state-store" where we can see

      • how many unhandled events are currently present in the event queue for the specific event type
      • how many events were handled for the specific event type
      • average execution time for the specific event

      The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store )

      Config name Description Default value
      yarn.dispatcher.multi-thread.{}.default-pool-size How many parallel threads should execute the parallel event execution 4
      yarn.dispatcher.multi-thread.{}.queue-size Size of the eventqueue 1 000 000
      yarn.dispatcher.multi-thread.{}.monitor-seconds The size of the event queue will be logged with this frequency (if not zero) 0
      yarn.dispatcher.multi-thread.{}.graceful-stop-seconds After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them 60
      yarn.dispatcher.multi-thread.{}.metrics-enabled The dispatcher should publish metrics data to the metric system false
      Example output from RM JMX api
      ...
          {
            "name": "Hadoop:service=ResourceManager,name=Event metrics for rm-state-store",
            "modelerType": "Event metrics for rm-state-store",
            "tag.Context": "yarn",
            "tag.Hostname": CENSORED
            "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51,
            "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0,
            "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0,
            "RMStateStoreEventType#STORE_APP_Current": 124,
            "RMStateStoreEventType#STORE_APP_NumOps": 46,
            "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25,
            "RMStateStoreEventType#UPDATE_APP_Current": 31,
            "RMStateStoreEventType#UPDATE_APP_NumOps": 16,
            "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.6666666666665,
            "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31,
            "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12,
            "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.6666666666665,
            "RMStateStoreEventType#REMOVE_APP_Current": 12,
            "RMStateStoreEventType#REMOVE_APP_NumOps": 3,
            "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0,
            "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0,
            "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0,
            "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0,
            "RMStateStoreEventType#FENCED_Current": 0,
            "RMStateStoreEventType#FENCED_NumOps": 0,
            "RMStateStoreEventType#FENCED_AvgTime": 0.0,
            "RMStateStoreEventType#STORE_MASTERKEY_Current": 0,
            "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0,
            "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0,
            "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0,
            "RMStateStoreEventType#REMOVE_MASTERKEY_NumOps": 0,
            "RMStateStoreEventType#REMOVE_MASTERKEY_AvgTime": 0.0,
            "RMStateStoreEventType#STORE_DELEGATION_TOKEN_Current": 0,
            "RMStateStoreEventType#STORE_DELEGATION_TOKEN_NumOps": 0,
            "RMStateStoreEventType#STORE_DELEGATION_TOKEN_AvgTime": 0.0,
            "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_Current": 0,
            "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_NumOps": 0,
            "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_AvgTime": 0.0,
            "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_Current": 0,
            "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_NumOps": 0,
            "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_AvgTime": 0.0,
            "RMStateStoreEventType#UPDATE_AMRM_TOKEN_Current": 0,
            "RMStateStoreEventType#UPDATE_AMRM_TOKEN_NumOps": 0,
            "RMStateStoreEventType#UPDATE_AMRM_TOKEN_AvgTime": 0.0,
            "RMStateStoreEventType#STORE_RESERVATION_Current": 0,
            "RMStateStoreEventType#STORE_RESERVATION_NumOps": 0,
            "RMStateStoreEventType#STORE_RESERVATION_AvgTime": 0.0,
            "RMStateStoreEventType#REMOVE_RESERVATION_Current": 0,
            "RMStateStoreEventType#REMOVE_RESERVATION_NumOps": 0,
            "RMStateStoreEventType#REMOVE_RESERVATION_AvgTime": 0.0,
            "RMStateStoreEventType#STORE_PROXY_CA_CERT_Current": 0,
            "RMStateStoreEventType#STORE_PROXY_CA_CERT_NumOps": 0,
            "RMStateStoreEventType#STORE_PROXY_CA_CERT_AvgTime": 0.0
          },
      ...
      

      Testing

      I deployed the MultiDispatcher supported version of yarn to the cluster and applied the following performance test:

      #!/bin/bash
      for i in {1..50}; 
      do
       	ssh root@$i-node-url 'nohup ./perf.sh 4 1>/dev/null 2>/dev/nul &' &
      done
      sleep 300
      for i in {1..50}; 
      do
      	ssh root@$i-node-url "pkill -9 -f perf" &
      done
      sleep 5
      echo "DONE"
      

      Each node had do following perf script

      #!/bin/bash
      while true
      do
          if [ $(ps -o pid= -u hadoop | wc -l) -le $1 ]
          then
              hadoop jar /opt/hadoop-mapreduce-examples.jar pi 20 20 1>/dev/null 2>&1 &
          fi
          sleep 1
      done
      

      This way in 5 minute (+ wait until all job finish) i could process 332 app.
      After i tested the same with the official build i needed 5 minute only to finish with the first app, after that 221 app were finished.

      I also tested it with LeveldbRMStateStore and ZKRMStateStore and did not found any problem with the implementation

      Attachments

        1. issue.png
          342 kB
          Bence Kosztolnik
        2. log.png
          2.86 MB
          Bence Kosztolnik

        Issue Links

          Activity

            People

              bkosztolnik Bence Kosztolnik
              bkosztolnik Bence Kosztolnik
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: