[YARN-11656] RMStateStore event queue blocked - ASF JIRA

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.4.1
Fix Version/s: None
Component/s: yarn
Labels:
- pull-request-available

Target Version/s:

3.4.1

Description

Problem statement

I observed Yarn cluster has pending and available resources as well, but the cluster utilization is usually around ~50%. The cluster had loaded with 200 parallel PI example job (from hadoop-mapreduce-examples) with 20 map and 20 reduce containers configured, on a 50 nodes cluster, where each node had 8 cores, and a lot of memory (there was cpu bottleneck).
Finally, I realized the RM had some IO bottleneck and needed 1~20 seconds to persist a RMStateStoreEvent (using FileSystemRMStateStore).

To reduce the impact of the issue:

create a dispatcher where events can persist in parallel threads
create metric data for the RMStateStore event queue to be able easily to identify the problem if occurs on a cluster

Issue visible on UI2

Also another way to identify the issue if we can see too much time is required to store info for app after reach new_saving state

How issue can look like in log

Solution

Created a MultiDispatcher class which implements the Dispatcher interface.
The Dispatcher creates a separate metric object called Event metrics for "rm-state-store" where we can see

how many unhandled events are currently present in the event queue for the specific event type
how many events were handled for the specific event type
average execution time for the specific event

The dispatcher has the following configs ( the placeholder is for the dispatcher name, for example, rm-state-store )

Config name	Description	Default value
yarn.dispatcher.multi-thread.{}.default-pool-size	How many parallel threads should execute the parallel event execution	4
yarn.dispatcher.multi-thread.{}.queue-size	Size of the eventqueue	1 000 000
yarn.dispatcher.multi-thread.{}.monitor-seconds	The size of the event queue will be logged with this frequency (if not zero)	0
yarn.dispatcher.multi-thread.{}.graceful-stop-seconds	After the stop signal the dispatcher will wait this many seconds to be able to process the incoming events before terminating them	60
yarn.dispatcher.multi-thread.{}.metrics-enabled	The dispatcher should publish metrics data to the metric system	false

Example output from RM JMX api

...
    {
      "name": "Hadoop:service=ResourceManager,name=Event metrics for rm-state-store",
      "modelerType": "Event metrics for rm-state-store",
      "tag.Context": "yarn",
      "tag.Hostname": CENSORED
      "RMStateStoreEventType#STORE_APP_ATTEMPT_Current": 51,
      "RMStateStoreEventType#STORE_APP_ATTEMPT_NumOps": 0,
      "RMStateStoreEventType#STORE_APP_ATTEMPT_AvgTime": 0.0,
      "RMStateStoreEventType#STORE_APP_Current": 124,
      "RMStateStoreEventType#STORE_APP_NumOps": 46,
      "RMStateStoreEventType#STORE_APP_AvgTime": 3318.25,
      "RMStateStoreEventType#UPDATE_APP_Current": 31,
      "RMStateStoreEventType#UPDATE_APP_NumOps": 16,
      "RMStateStoreEventType#UPDATE_APP_AvgTime": 2629.6666666666665,
      "RMStateStoreEventType#UPDATE_APP_ATTEMPT_Current": 31,
      "RMStateStoreEventType#UPDATE_APP_ATTEMPT_NumOps": 12,
      "RMStateStoreEventType#UPDATE_APP_ATTEMPT_AvgTime": 2048.6666666666665,
      "RMStateStoreEventType#REMOVE_APP_Current": 12,
      "RMStateStoreEventType#REMOVE_APP_NumOps": 3,
      "RMStateStoreEventType#REMOVE_APP_AvgTime": 1378.0,
      "RMStateStoreEventType#REMOVE_APP_ATTEMPT_Current": 0,
      "RMStateStoreEventType#REMOVE_APP_ATTEMPT_NumOps": 0,
      "RMStateStoreEventType#REMOVE_APP_ATTEMPT_AvgTime": 0.0,
      "RMStateStoreEventType#FENCED_Current": 0,
      "RMStateStoreEventType#FENCED_NumOps": 0,
      "RMStateStoreEventType#FENCED_AvgTime": 0.0,
      "RMStateStoreEventType#STORE_MASTERKEY_Current": 0,
      "RMStateStoreEventType#STORE_MASTERKEY_NumOps": 0,
      "RMStateStoreEventType#STORE_MASTERKEY_AvgTime": 0.0,
      "RMStateStoreEventType#REMOVE_MASTERKEY_Current": 0,
      "RMStateStoreEventType#REMOVE_MASTERKEY_NumOps": 0,
      "RMStateStoreEventType#REMOVE_MASTERKEY_AvgTime": 0.0,
      "RMStateStoreEventType#STORE_DELEGATION_TOKEN_Current": 0,
      "RMStateStoreEventType#STORE_DELEGATION_TOKEN_NumOps": 0,
      "RMStateStoreEventType#STORE_DELEGATION_TOKEN_AvgTime": 0.0,
      "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_Current": 0,
      "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_NumOps": 0,
      "RMStateStoreEventType#REMOVE_DELEGATION_TOKEN_AvgTime": 0.0,
      "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_Current": 0,
      "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_NumOps": 0,
      "RMStateStoreEventType#UPDATE_DELEGATION_TOKEN_AvgTime": 0.0,
      "RMStateStoreEventType#UPDATE_AMRM_TOKEN_Current": 0,
      "RMStateStoreEventType#UPDATE_AMRM_TOKEN_NumOps": 0,
      "RMStateStoreEventType#UPDATE_AMRM_TOKEN_AvgTime": 0.0,
      "RMStateStoreEventType#STORE_RESERVATION_Current": 0,
      "RMStateStoreEventType#STORE_RESERVATION_NumOps": 0,
      "RMStateStoreEventType#STORE_RESERVATION_AvgTime": 0.0,
      "RMStateStoreEventType#REMOVE_RESERVATION_Current": 0,
      "RMStateStoreEventType#REMOVE_RESERVATION_NumOps": 0,
      "RMStateStoreEventType#REMOVE_RESERVATION_AvgTime": 0.0,
      "RMStateStoreEventType#STORE_PROXY_CA_CERT_Current": 0,
      "RMStateStoreEventType#STORE_PROXY_CA_CERT_NumOps": 0,
      "RMStateStoreEventType#STORE_PROXY_CA_CERT_AvgTime": 0.0
    },
...

Testing

I deployed the MultiDispatcher supported version of yarn to the cluster and applied the following performance test:

#!/bin/bash
for i in {1..50}; 
do
 	ssh root@$i-node-url 'nohup ./perf.sh 4 1>/dev/null 2>/dev/nul &' &
done
sleep 300
for i in {1..50}; 
do
	ssh root@$i-node-url "pkill -9 -f perf" &
done
sleep 5
echo "DONE"

Each node had do following perf script

#!/bin/bash
while true
do
    if [ $(ps -o pid= -u hadoop | wc -l) -le $1 ]
    then
        hadoop jar /opt/hadoop-mapreduce-examples.jar pi 20 20 1>/dev/null 2>&1 &
    fi
    sleep 1
done

This way in 5 minute (+ wait until all job finish) i could process 332 app.
After i tested the same with the official build i needed 5 minute only to finish with the first app, after that 221 app were finished.

I also tested it with LeveldbRMStateStore and ZKRMStateStore and did not found any problem with the implementation

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

issue.png
21/Feb/24 18:38
342 kB
Bence Kosztolnik
log.png
21/Feb/24 19:16
2.86 MB
Bence Kosztolnik

Issue Links

links to

GitHub Pull Request #6569

RMStateStore event queue blocked

Details

Description

Problem statement

Solution

Testing

Attachments

Attachments

Issue Links

Activity

People

Dates