Details
-
Improvement
-
Status: Open
-
Normal
-
Resolution: Unresolved
-
None
-
None
-
Contrail 3.0.3.3-22/Cassandra 2.1.13
Description
Partial network connectivity (e.g. and MTU mismatch that blackholes jumbo frames) can cause a node to get stuck in a permanent UJ status (as reflected in nodetool). It's possible the node can stay in this way for an extended period of time. Once the isolated node rejoins due to a network repair, it can cause extensive data loss to the healthy nodes.
If the node were completely isolated gc_grace_seconds would prevent the node from joining after the specified period. Other corner cases besides "DN" should be covered if applicable.
Reference:
JTAC 2018-0303-0029