[HADOOP-19047] Support InMemory Tracking Of S3A Magic Commits - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.5.0, 3.4.1
Component/s: fs/s3
Labels:
- pull-request-available

Description

The following are the operations which happens within a Task when it uses S3A Magic Committer.

During closing of stream

1. A 0-byte file with a same name of the original file is uploaded to S3 using PUT operation. Refer here for more information. This is done so that the downstream application like Spark could get the size of the file which is being written.

2. MultiPartUpload(MPU) metadata is uploaded to S3. Refer here for more information.

During TaskCommit

1. All the MPU metadata which the task wrote to S3 (There will be 'x' number of metadata file in S3 if a single task writes to 'x' files) are read and rewritten to S3 as a single metadata file. Refer here for more information

Since these operations happens with the Task JVM, We could optimize as well as save cost by storing these information in memory when Task memory usage is not a constraint. Hence the proposal here is to introduce a new MagicCommit Tracker called "InMemoryMagicCommitTracker" which will store the

1. Metadata of MPU in memory till the Task is committed
2. Store the size of the file which can be used by the downstream application to get the file size before it is committed/visible to the output path.

This optimization will save 2 PUT S3 calls, 1 LIST S3 call, and 1 GET S3 call given a Task writes only 1 file.

Attachments

Issue Links

links to

GitHub Pull Request #6468

Activity

People

Assignee:: Syed Shameerur Rahman

Reporter:: Syed Shameerur Rahman

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Jan/24 10:18

Updated:: 26/Mar/24 17:31

Resolved:: 26/Mar/24 17:31