[SOLR-11730] Test NodeLost / NodeAdded dynamics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: AutoScaling
Labels:
None

Description

Let's consider a "flaky node" scenario.

A node is going up and down at short intervals (eg. due to a flaky network cable). If the frequency of these events coincides with waitFor interval in nodeLost trigger configuration, the node may never be reported to the autoscaling framework as lost. Similarly it may never be reported as added back if it's lost again within the waitFor period of nodeAdded trigger.

Other scenarios are possible here too, depending on timing:

node being constantly reported as lost
node being constantly reported as added

One possible solution for the autoscaling triggers is that the framework should keep a short-term (waitFor * 2 long?) memory of a node state that the trigger is tracking in order to eliminate flaky nodes (ie. those that transitioned between states more than once within the period).

Situation like this is detrimental to SolrCloud behavior regardless of autoscaling actions, so it should probably be addressed at a node level by eg. shutting down Solr node after the number of disconnects in a time window reaches a certain threshold.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Andrzej Bialecki

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Dec/17 18:48

Updated:: 14/Aug/21 01:33

Resolved:: 14/Aug/21 01:33