Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9328

Netty IO layer should implement read timeouts

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.2.1, 1.3.1
    • 1.4.0
    • Shuffle, Spark Core
    • None

    Description

      Spark's network layer does not implement read timeouts which may lead to stalls during shuffle: if a remote shuffle server stalls while responding to a shuffle block fetch request but does not close the socket then the job may block until an OS-level socket timeout occurs.

      I think that we can fix this using Netty's ReadTimeoutHandler (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler). The tricky part of working on this will be figuring out the right place to add the handler and ensuring that we don't introduce performance issues by not re-using sockets.

      Quoting from that linked StackOverflow question:

      Note that the ReadTimeoutHandler is also unaware of whether you have sent a request - it only cares whether data has been read from the socket. If your connection is persistent, and you only want read timeouts to fire when a request has been sent, you'll need to build a request / response aware timeout handler.

      If we want to avoid tearing down connections between shuffles then we may have to do something like this.

      Attachments

        Activity

          People

            joshrosen Josh Rosen
            joshrosen Josh Rosen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: