[SPARK-9850] Adaptive execution in Spark - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Epic
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Spark Core, SQL
Labels:
None

Description

Query planning is one of the main factors in high performance, but the current Spark engine requires the execution DAG for a job to be set in advance. Even with cost-based optimization, it is hard to know the behavior of data and user-defined functions well enough to always get great execution plans. This JIRA proposes to add adaptive query execution, so that the engine can change the plan for each query as it sees what data earlier stages produced.

We propose adding this to Spark SQL / DataFrames first, using a new API in the Spark engine that lets libraries run DAGs adaptively. In future JIRAs, the functionality could be extended to other libraries or the RDD API, but that is more difficult than adding it in SQL.

I've attached a design doc by Yin Huai and myself explaining how it would work in more detail.

Attachments

AdaptiveExecutionInSpark.pdf
11/Aug/15 23:41
195 kB
Matei Alexandru Zaharia

Issue Links

Add Link

is a parent of

SPARK-28231 Do not modify the number of partitions set by the user when using repartition

Open

Delete this link

is duplicated by

SPARK-4630 Dynamically determine optimal number of partitions

Closed

Delete this link

is related to

SPARK-23128 The basic framework for the new Adaptive Query Execution

Resolved

Delete this link

SPARK-31412 New Adaptive Query Execution in Spark SQL

Closed

Delete this link

Sub-Tasks

Create Sub-Task

1.	Support submitting map stages individually in DAGScheduler	Resolved	Matei Alexandru Zaharia	Actions
2.	Let reduce tasks fetch multiple map output partitions	Resolved	Matei Alexandru Zaharia	Actions
3.	Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.	Resolved	Yin Huai	Actions
4.	Aggregation: Determine the number of reducers at runtime	Resolved	Yin Huai	Actions
5.	Join: Determine the join strategy (broadcast join or shuffle join) at runtime	Resolved	Unassigned	Actions
6.	Join: Determine the number of reducers used by a shuffle join operator at runtime	Resolved	Yin Huai	Actions
7.	Join: Handling data skew	Resolved	Unassigned	Actions
8.	Improve the way that we plan shuffled join	Resolved	Unassigned	Actions
9.	Allow shuffle readers to request data from just one mapper	Resolved	Unassigned	Actions