[CASSANDRA-19330] speculative read replies may be created too eagerly and swamp the cluster - ASF JIRA

Details

Type: Bug
Status: Patch Available
Priority: Normal
Resolution: Unresolved
Fix Version/s: 4.0.x, 4.1.x, 5.0.x, 5.x
Component/s: Legacy/Coordination
Labels:
None

Bug Category:
Degradation - Performance Bug/Regression
Severity:
Normal
Complexity:
Normal
Discovered By:
Adhoc Test
Platform:

All
Impacts:

None
Source Control Link:

https://github.com/apache/cassandra/pull/3107
Test and Documentation Plan:

Hide

ad hoc performance test with saturation of the nodes with read workload

Show
ad hoc performance test with saturation of the nodes with read workload

Description

This problem affects at least 4.x, newer, and possibly older versions.
CASSANDRA-19215 proposed patch (https://github.com/apache/cassandra/pull/3064/files) makes it trivial to hit on the trunk.
The description below uses the `trunk` code, but the same problems happen in earlier versions even if, e.g., class names are changed, etc.

In one of our tests with RF=3 and QUORUM reads, we observed that the number of speculative read retries was of the same order of magnitude as the number of client reads.

In effect, this reduced the cluster throughput by roughly 1/3

The speculative read retries mechanism and `PercentileSpeculativeRetryPolicy` in particular suffer from several problems:

1. triggering a speculative retry on the basis of a comparison of two distinct intervals

The speculative retry is triggered on basis of:

`CFS::sampleReadLatency*` field, which is sourced from the `coordinatorReadLatency` metric
`queryStartNanoTime` (`ReadCallback::await` called by `AbstractReadExecutor::shouldSpeculateAndMaybeWait`)

`coordinatorReadLatency` measures (i.a.) the duration of the `StorageProxy::readRegular` function.
`queryStartNanoTime` measurement happens earlier; with CASSANDRA-19215 (and `cql_start_time = QUEUE`) the measurement happens much earlier

The time distance difference between these two intervals is not accounted for in any way, and they are treated as representing measurements of the same interval in
`AbstractReadExecutor::shouldSpeculateAndMaybeWait`, `ReadCallback::await` duo

This together means that saturating the server with reads, which causes a buildup of `Native-Transport-Requests` queue, will inevitably trigger needless speculative read retries, as the sampleReadLatency doesn't consider this queue wait time.

2. triggering a speculative retry based on time spent coordinating

I assume that speculative retries aim to reduce latency/prevent timeout when some replicas that are contacted first (in `AbstractReadExecutor::executeAsync`) do not respond timely, for whatever reason.
I understand the desire to limit the user-observable latency and thus base the retry decision on the closest approximation of the user-observable latency we have (`coordinatorReadLatency`). Still, I believe it is a flawed approach.
If the coordination part is fast, and some replicas are late, it doesn't really matter if we include coordination time.
If, on the other hand, for whatever reason, the coordination part gets slow, triggering additional speculative retries will only make matters worse.

3. `coordinatorReadLatency` metric is counter-intuitive

I understand this point is subjective. That said, I can't find a reason why the `coordinatorReadLatency`
should not cover the longest interval feasible, i.e. from `queryStartNanoTime` till the end of execution,
instead of a specific portion of the code (which, admittedly, does most of the read-related work).

This is another facet of point 1 - not only do we measure two different intervals, but both seem to be measured incorrectly.

4. the metric values used to update `sampleReadLatency*` field are sticky to histogram bucket ends

This is more something to be aware of rather than a particular problem to solve. Still, I felt it is worth mentioning because perhaps there's some improvement to be made that I'm not aware of.

We are using the `coordinatorReadLatency` metric as the source for `sampleReadLatency*` field.
Let's assume we're using 99th percentile of that metric. At its heart, the metric is powered by a `DecayingEstimatedHistogramReservoir`. The actual percentile value is computed in `AbstractSnapshot::getValue`.
The value returned by this function is always a bucket boundary (`bucketOffsets[i]`).
For example, if we have bucket boundaries ...,20.5ms, 24ms, ..., and we update the Timer only with measurements in the range 21-23ms range, then ALL percentile values will equal 24ms.

This is different than in other systems. E.g. in Prometheus, a linear interpolation between bucket ends is performed to estimate a more fine-grained value of a particular sample.

The bucket boundaries grow by ~20% each, so at the level of 20ms, a bucket covers roughly 4ms timespan; at the level of 100ms, it covers roughly 20ms timespan.

Fortunately, the percentile value we get is the upper end of the bucket, making the speculative retries more difficult to trigger.

One additional note is that the stickiness of the percentiles stablizes the behaviour of percentile-based decision makers, e.g. the `DynamicEndpointSnitch`. Implementing linear interpolation in `AbstractSnapshot::getValue` may unexpectedly influence such components.

I will follow up with specific improvement proposals.

speculative read replies may be created too eagerly and swamp the cluster

Details

Description

Attachments

Activity

People

Dates