Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26175

PySpark cannot terminate worker process if user program reads from stdin

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 3.0.0
    • Component/s: PySpark
    • Labels:
    • Target Version/s:

      Description

      PySpark worker daemon reads from stdin the worker PIDs to kill. https://github.com/apache/spark/blob/1bb60ab8392adf8b896cc04fb1d060620cf09d8a/python/pyspark/daemon.py#L127

      However, the worker process is a forked process from the worker daemon process and we didn't close stdin on the child after fork. This means the child and user program can read stdin as well, which blocks daemon from receiving the PID to kill. This can cause issues because the task reaper might detect the task was not terminated and eventually kill the JVM.

      Possible fix could be:

      • Closing stdin of the worker process right after fork.
      • Creating a new socket to receive PIDs to kill instead of using stdin.

      Steps to reproduce

      1. Paste the following code in pyspark:
        import subprocess
        def task(_):
          subprocess.check_output(["cat"])
        
        sc.parallelize(range(1), 1).mapPartitions(task).count()
        
      2. Press CTRL+C to cancel the job.
      3. The following message is displayed:
        18/11/26 17:52:51 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) interrupted: Attempting to kill Python Worker
        18/11/26 17:52:52 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): TaskKilled (Stage cancelled)
        
      4. Run ps -xf to see that cat process was in fact not killed:
        19773 pts/2    Sl+    0:00  |   |   \_ python
        19803 pts/2    Sl+    0:11  |   |       \_ /usr/lib/jvm/java-8-oracle/bin/java -cp /home/ala/Repos/apache-spark-GOOD-2/conf/:/home/ala/Repos/apache-spark-GOOD-2/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit --name PySparkShell pyspark-shell
        19879 pts/2    S      0:00  |   |           \_ python -m pyspark.daemon
        19895 pts/2    S      0:00  |   |               \_ python -m pyspark.daemon
        19898 pts/2    S      0:00  |   |                   \_ cat
        

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                weichenxu123 Weichen Xu
                Reporter:
                ala.luszczak Ala Luszczak
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: