[SPARK-9328] Netty IO layer should implement read timeouts - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.2.1, 1.3.1
Fix Version/s: 1.4.0
Component/s: Shuffle, Spark Core
Labels:
None

Description

Spark's network layer does not implement read timeouts which may lead to stalls during shuffle: if a remote shuffle server stalls while responding to a shuffle block fetch request but does not close the socket then the job may block until an OS-level socket timeout occurs.

I think that we can fix this using Netty's ReadTimeoutHandler (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). The tricky part of working on this will be figuring out the right place to add the handler and ensuring that we don't introduce performance issues by not re-using sockets.

Quoting from that linked StackOverflow question:

Note that the ReadTimeoutHandler is also unaware of whether you have sent a request - it only cares whether data has been read from the socket. If your connection is persistent, and you only want read timeouts to fire when a request has been sent, you'll need to build a request / response aware timeout handler.

If we want to avoid tearing down connections between shuffles then we may have to do something like this.

Attachments

Issue Links

links to

[Github] Pull Request #7653 (JoshRosen)

Activity

People

Assignee:: Josh Rosen

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 24/Jul/15 22:31

Updated:: 24/Nov/15 23:32

Resolved:: 24/Nov/15 23:31