[SPARK-44306] Group FileStatus with few RPC calls within Yarn Client - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.2, 2.3.0, 3.5.0
Fix Version/s: 4.0.0
Component/s: Spark Submit
Labels:
- pull-request-available

Description

It's inefficient to obtain FileStatus for each resource one by one. In our company setting, we are running Spark with Hadoop Yarn and HDFS. We noticed the current behavior has two major drawbacks:

Since each getFileStatus call involves network delays, the overall delay can be large and add uncertainty to the overall Spark job runtime. Specifically, we quantify this overhead within our cluster. We see the p50 overhead is around 10s, p80 is 1 min, and p100 is up to 15 mins. When HDFS is overloaded, the delays become more severe.
In our cluster, we have nearly 100 million getFileStatus call to HDFS daily. We noticed that in our cluster, most resources come from the same HDFS directory for each user (See our engineer blog post about why we took this approach). Therefore, we can greatly reduce nearly 100 million getFileStatus call to 0.1 million listStatus calls daily. This will further reduce overhead from the HDFS side.

All in all, a more efficient way to fetch the FileStatus for each resource is highly needed.

Attachments

Issue Links

links to

[Github] Pull Request #42357 (shuwang21)

GitHub Pull Request #42357

Activity

People

Assignee:: SHU WANG

Reporter:: SHU WANG

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 05/Jul/23 03:59

Updated:: 19/Sep/23 17:50

Resolved:: 19/Sep/23 17:50