Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
Correctness - Transient Incorrect Response
-
Normal
-
Normal
-
User Report
-
All
-
None
-
Description
When assassinate an ip address that is not in the gossip map, a "corrupted" entry will be inserted into the gossip map. (1)
For example, if we do "nodetool assassinate 10.1.1.1"
we will get an entry like below by running "nodetool gossipinfo":
/10.1.1.1 generation:1702006511 heartbeat:9999 STATUS:209516:LEFT,-8393921141401589197,1702265651923 STATUS_WITH_PORT:209515:LEFT,-8393921141401589197,1702265651923 TOKENS: not present
This entry in endpointStateMap will cause issue for isUpgradingFromVersionLowerThan function. Because the upgradeFromVersionSupplier supplier will always set the allHostsHaveKnownVersion flag to false so no memoized value will be returned. The "get" function will always require a lock from this line.
If application is using "fetchAll", the native-transport-requests thread will hit this line. This means all the native-transport-requests thread is serialized, also, the lock is shared by GossipStage threads. It means if a node in a cluster with the corrupted gossip map is restart, the node will run into this problem.
To fix the issue,
- Why we want to add a dummy entry for nodetool assassinate if the endpoint is not in the map anymore. Should we do nothing or throw exception if the node is not in the gossip map anymore?
- Before checking if a version is null, we should make sure the node is not a dead node. A decommissioned node, a left node should not be considered part of the cluster anymore when calculating "upgradeInProgressPossible"