Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3358

PySpark worker fork()ing performance regression in m3.* / PVM instances

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.1.0
    • 1.1.0, 1.2.0
    • PySpark
    • None
    • m3.* instances on EC2

    Description

      SPARK-2764 (and some followup commits) simplified PySpark's worker process structure by removing an intermediate pool of processes forked by daemon.py. Previously, daemon.py forked a fixed-size pool of processes that shared a socket and handled worker launch requests from Java. After my patch, this intermediate pool was removed and launch requests are handled directly in daemon.py.

      Unfortunately, this seems to have increased PySpark task launch latency when running on m3* class instances in EC2. Most of this difference can be attributed to m3 instances' more expensive fork() system calls. I tried the following microbenchmark on m3.xlarge and r3.xlarge instances:

      import os
      
      for x in range(1000):
        if os.fork() == 0:
          exit()
      

      On the r3.xlarge instance:

      real	0m0.761s
      user	0m0.008s
      sys	0m0.144s
      

      And on m3.xlarge:

      real    0m1.699s
      user    0m0.012s
      sys     0m1.008s
      

      I think this is due to HVM vs PVM EC2 instances using different virtualization technologies with different fork costs.

      It may be the case that this performance difference only appears in certain microbenchmarks and is masked by other performance improvements in PySpark, such as improvements to large group-bys. I'm in the process of re-running spark-perf benchmarks on m3 instances in order to confirm whether this impacts more realistic jobs.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              joshrosen Josh Rosen
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: