Details
Description
SPARK-2764 (and some followup commits) simplified PySpark's worker process structure by removing an intermediate pool of processes forked by daemon.py. Previously, daemon.py forked a fixed-size pool of processes that shared a socket and handled worker launch requests from Java. After my patch, this intermediate pool was removed and launch requests are handled directly in daemon.py.
Unfortunately, this seems to have increased PySpark task launch latency when running on m3* class instances in EC2. Most of this difference can be attributed to m3 instances' more expensive fork() system calls. I tried the following microbenchmark on m3.xlarge and r3.xlarge instances:
import os for x in range(1000): if os.fork() == 0: exit()
On the r3.xlarge instance:
real 0m0.761s user 0m0.008s sys 0m0.144s
And on m3.xlarge:
real 0m1.699s user 0m0.012s sys 0m1.008s
I think this is due to HVM vs PVM EC2 instances using different virtualization technologies with different fork costs.
It may be the case that this performance difference only appears in certain microbenchmarks and is masked by other performance improvements in PySpark, such as improvements to large group-bys. I'm in the process of re-running spark-perf benchmarks on m3 instances in order to confirm whether this impacts more realistic jobs.
Attachments
Issue Links
- relates to
-
SPARK-3333 Large number of partitions causes OOM
- Resolved
- links to