[HADOOP-18971] ABFS: Enable Footer Read Optimizations with Appropriate Footer Read Buffer Size - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.6
Fix Version/s: 3.4.0
Component/s: fs/azure
Labels:
- pull-request-available

Target Version/s:

3.4.0
Hadoop Flags:

Reviewed

Description

Footer Read Optimization was introduced to Hadoop azure in this Jira: https://issues.apache.org/jira/browse/HADOOP-17347
and was kept disabled by default.
This PR is to enable footer reads by default based on the results of analysis performed as below:

In our scale workload analysis, it was found that workloads working with Parquet (or for that matter OCR etc.) have a lot of footer reads. Footer reads here refers to the read operations done by workload to get the metadata of the parquet file which is required to understand where the actual data resides in the parquet.
This whole process takes place in 3 steps:

Workload reads the last 8 bytes of parquet file to get the offset and size of the metadata which is present just above these 8 bytes.
Using that offset, workload reads the metadata to get the exact offset and length of data which it wants to read.
Workload performs the final read operation to get the data it wants to use for its purpose.

Here the first two steps are metadata reads that can be combined into a single footer read. When workload tries to read certain last few bytes of data (let's say this value is footer size), driver will intelligently read some extra bytes above the footer size to cater to the next read which is going to come.

Q. What is the footer size of file?
A: 16KB. Any read request trying to get the data within last 16KB of the file will qualify for whole footer read. This value is enough to cater to all types of files including parquet, OCR, etc.

Q. What is the buffer size to read when reading the footer?
A. Let's call this footer read buffer size. Prior to this PR footer read buffer size was same as read buffer size (default 4MB). It was found that for most of the workload required footer size was only 256KB. i.e. For almost all parquet files metadata for that file was found to be within last 256KBs. Keeping this in mind it does not make sense to read whole buffer length of 4MB as a part of footer read. Moreover, reading larger data than require incur additional costs in terms of server and network latencies. Based on this and extensive experimentation it was observed that footer read buffer size of 512KB is ideal for almost all the workloads running on parquet, OCR, etc.

Following configuration was introduced to configure the footer read buffer size:
fs.azure.footer.read.request.size: default 512 KB.

Quantitative Stats: For a workload running on parquet files the number of read requests got reduced by 2.3M down from 20M. That means around 10% reduction in overall TPS.

Attachments

Issue Links

links to

GitHub Pull Request #6270

Activity

People

Assignee:: Anuj Modi

Reporter:: Anuj Modi

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Nov/23 15:55

Updated:: 27/Jan/24 06:53

Resolved:: 03/Jan/24 12:50