[HADOOP-16823] Large DeleteObject requests are their own Thundering Herd - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.1
Fix Version/s: 3.3.0
Component/s: fs/s3
Labels:
None

Target Version/s:

3.3.0
Release Note:

Hide
The page size for bulk delete operations has been reduced from 1000 to 250 to reduce the likelihood of overloading an S3 partition, especially because the retry policy on throttling is simply to try again.

The page size can be set in "fs.s3a.bulk.delete.page.size"

There is also an option to control whether or not the AWS client retries requests, or whether it is handled exclusively in the S3A code. This option "fs.s3a.experimental.aws.s3.throttling" is true by default. If set to false: everything is handled in the S3A client. While this means that metrics may be more accurate, it may mean that throttling failures in helper threads of the AWS SDK (especially those used in copy/rename) may not be handled properly. This is experimental, and should be left at "true" except when seeking more detail about throttling rates.

Show
The page size for bulk delete operations has been reduced from 1000 to 250 to reduce the likelihood of overloading an S3 partition, especially because the retry policy on throttling is simply to try again. The page size can be set in "fs.s3a.bulk.delete.page.size" There is also an option to control whether or not the AWS client retries requests, or whether it is handled exclusively in the S3A code. This option "fs.s3a.experimental.aws.s3.throttling" is true by default. If set to false: everything is handled in the S3A client. While this means that metrics may be more accurate, it may mean that throttling failures in helper threads of the AWS SDK (especially those used in copy/rename) may not be handled properly. This is experimental, and should be left at "true" except when seeking more detail about throttling rates.

Description

Currently AWS S3 throttling is initially handled in the AWS SDK, only reaching the S3 client code after it has given up.

This means we don't always directly observe when throttling is taking place.

Proposed:

disable throttling retries in the AWS client library
add a quantile for the S3 throttle events, as DDB has
isolate counters of s3 and DDB throttle events to classify issues better

Because we are taking over the AWS retries, we will need to expand the initial delay en retries and the number of retries we should support before giving up.

Also: should we log throttling events? It could be useful but there is a risk of logs overloading especially if many threads in the same process were triggering the problem.

Proposed: log at debug.

Note: if S3 bucket logging is enabled then throttling events will be recorded as 503 responses in the logs. If the hadoop version contains the audit logging of ~~HADOOP-17511~~, this can be used to identify operations/jobs/users which are triggering problems.

Attachments

Issue Links

is related to

HADOOP-13811 s3a: getFileStatus fails with com.amazonaws.AmazonClientException: Failed to sanitize XML document destined for handler class

Resolved

HADOOP-17935 Spark job stuck in S3A StagingCommitter::setupJob

Resolved

links to

GitHub Pull Request #1814

GitHub Pull Request #1826

Activity

People

Assignee:: Steve Loughran

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 21/Jan/20 18:41

Updated:: 21/Apr/22 13:33

Resolved:: 13/Feb/20 20:07