[HDFS-17166] RBF: Throwing NoNamenodesAvailableException for a long time, when failover - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: rbf
Labels:
- pull-request-available

Target Version/s:

3.4.0
Hadoop Flags:

Reviewed

Description

When ns failover， the router may record that the ns have no active namenode, the router cannot find the active nn in the ns for about 1 minute. The client will report an error after consuming the number of retries, and the router will be unable to provide services for the ns for a long time.

11:52:44 Start reporting

11:53:46 end reporting

At this point, the failover has been successfully completed in the ns, and the client can directly connect to the active namenode to access it successfully, but the client cannot access the ns through router for up to a minute

There is a bug in this logic：

A certain ns starts to fail over,

There is a state where there is no active nn in ns, Router reports the status (no active nn) to the state store

After a period of time, the router pulls the state store data to update the cache, and the cache records that the ns has no active nn
Failover successfully completed, at which point the ns actually has an active nn

Assuming it's not time for router to update the cache yet

The client sent a request to the router for the ns, and the router accessed the first nn of the ns in the router’s cache (no active nn)

Unfortunately, the nn is really standby, so the request went wrong and entered the exception handling logic. The router found that there is no active nn for the ns in the cache and throw NoNamenodesAvailableException

The NoNamenodesAvailableException exception is wrapped as a RetrieveException, which causes the client to retry. Since each router retrieves the true standby nn in the cache (because it is always the first one in the cache and has a high priority), a NoNamenodesAvailableException is thrown every time until the router updates the cache from the state store

How to reproduce

Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and nn6002 is standby
Assuming that nn6001 and nn6002 are both in standby state, the priority of nn6002 is higher than nn6001
Use default configuration
Shutdown 2 nn's zkfs, hadoop-daemon.sh stop zkfc, manually perform failover
Manually switch nn6001 active->standby, hdfs haadmin -ns ns60 -transitionToStandby --forcemanual nn6001
Make sure that the NamenodeHeartbeatService reports that nn6001 is standby
Manually switch nn6001 active->standby, hdfs haadmin -ns ns60 -transitionToActive --forcemanual nn6001
The client accesses ns60 through router
After about one minute, request ns60 again through the router
Exceptions are reported for both requests, check the router log
The router cannot respond to the client's request for ns60 for a minute

Fix the bug

When an ns in the router's cache does not have an active nn, but in reality, the ns has an active nn, and the client requests to throw a NoNamenodesAvailableException, it is proven that the requested nn is a real standby nn. The priority of this nn should be lowered so that the next request will find the real active nn, avoiding constantly requesting the real standby nn, which will cause the cache to be updated before the next time, The router is unable to provide services for the ns to the client.

Test my patch

1. Unit testing

2. Comparison test

Suppose we have 2 clients [c1 c2], 2 routers [r1 r2] and a ns [ns60], the ns has 2 nn [nn6001 nn6002]
If both nn6001 and nn6002 are in standby state, the priority of nn6002 is higher than nn6001,
r1 uses the package that fixing the bug, r2 uses the original package which has the bug
c1 loops to send requests to r1, and c2 loops to send requests to r2, the request is related to ns60
Make both nn6001 and nn6002 in standby state
After the router reports that nn is in standby state, switch nn6001 to active
14:15:24 nn6001 is active

Check the log of router r1, after nn6001 switches to active, only NoNamenodesAvailableException is printed once

Check the log of router r2, and print NoNamenodesAvailableException for more than one minute after nn6001 switches to active

At 14:16:25, the client c2 accessing the router with the bug could not get the data, and the client c1 accessing the router after the bug was fixed could get the data normally:

c2's log:unable to access normally

c1's log:display the result correctly

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-17166.001.patch
26/Aug/23 03:08
14 kB
Jian Zhang
HDFS-17166.002.patch
26/Aug/23 03:07
9 kB
Jian Zhang
HDFS-17166.003.patch
29/Aug/23 01:04
12 kB
Jian Zhang
HDFS-17166.004.patch
30/Aug/23 02:34
6 kB
Jian Zhang
HDFS-17166.005.patch
30/Aug/23 02:35
2 kB
Jian Zhang
HDFS-17166.patch
06/Sep/23 02:41
15 kB
Jian Zhang
image-2023-08-26-11-48-22-131.png
26/Aug/23 03:48
29 kB
Jian Zhang
image-2023-08-26-11-56-50-181.png
26/Aug/23 03:56
66 kB
Jian Zhang
image-2023-08-26-11-59-25-153.png
26/Aug/23 03:59
63 kB
Jian Zhang
image-2023-08-26-12-01-39-968.png
26/Aug/23 04:01
48 kB
Jian Zhang
image-2023-08-26-12-06-01-275.png
26/Aug/23 04:06
181 kB
Jian Zhang
image-2023-08-26-12-07-47-010.png
26/Aug/23 04:07
67 kB
Jian Zhang
image-2023-08-26-22-45-46-814.png
26/Aug/23 14:45
162 kB
Jian Zhang
image-2023-08-26-22-47-22-276.png
26/Aug/23 14:47
33 kB
Jian Zhang
image-2023-08-26-22-47-41-988.png
26/Aug/23 14:47
640 kB
Jian Zhang
image-2023-08-26-22-48-02-086.png
26/Aug/23 14:48
65 kB
Jian Zhang
image-2023-08-26-22-48-12-352.png
26/Aug/23 14:48
207 kB
Jian Zhang

Issue Links

links to

GitHub Pull Request #5990

RBF: Throwing NoNamenodesAvailableException for a long time, when failover

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates