[SPARK-3358] PySpark worker fork()ing performance regression in m3.* / PVM instances - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.1.0, 1.2.0
Component/s: PySpark
Labels:
None
Environment:

m3.* instances on EC2

Description

~~SPARK-2764~~ (and some followup commits) simplified PySpark's worker process structure by removing an intermediate pool of processes forked by daemon.py. Previously, daemon.py forked a fixed-size pool of processes that shared a socket and handled worker launch requests from Java. After my patch, this intermediate pool was removed and launch requests are handled directly in daemon.py.

Unfortunately, this seems to have increased PySpark task launch latency when running on m3* class instances in EC2. Most of this difference can be attributed to m3 instances' more expensive fork() system calls. I tried the following microbenchmark on m3.xlarge and r3.xlarge instances:

import os

for x in range(1000):
  if os.fork() == 0:
    exit()

On the r3.xlarge instance:

real	0m0.761s
user	0m0.008s
sys	0m0.144s

And on m3.xlarge:

real    0m1.699s
user    0m0.012s
sys     0m1.008s

I think this is due to HVM vs PVM EC2 instances using different virtualization technologies with different fork costs.

It may be the case that this performance difference only appears in certain microbenchmarks and is masked by other performance improvements in PySpark, such as improvements to large group-bys. I'm in the process of re-running spark-perf benchmarks on m3 instances in order to confirm whether this impacts more realistic jobs.

Attachments

Issue Links

relates to

SPARK-3333 Large number of partitions causes OOM

Resolved

links to

[Github] Pull Request #2244 (pwendell)

Activity

People

Assignee:: Unassigned

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 03/Sep/14 00:37

Updated:: 13/Dec/14 07:33

Resolved:: 13/Dec/14 07:33