Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34726

Fix collectToPython timeouts

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.7
    • 2.4.8
    • SQL
    • None

    Description

      One of our customers frequently encounters "serve-DataFrame" java.net.SocketTimeoutException: Accept timed errors in PySpark because DataSet.collectToPython() in Spark 2.4 does the following:

      1. Collects the results
      2. Opens up a socket server that is then listening to the connection from Python side
      3. Runs the event listeners as part of withAction on the same thread as SPARK-25680 is not available in Spark 2.4
      4. Returns the address of the socket server to Python
      5. The Python side connects to the socket server and fetches the data

      As the customer has a custom, long running event listener the time between 2. and 5. is frequently longer than the default connection timeout and increasing the connect timeout is not a good solution as we don't know how long running the listeners can take.

      Attachments

        Activity

          People

            petertoth Peter Toth
            petertoth Peter Toth
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: