[SPARK-35414] Completely fix the broadcast timeout issue in AQE - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 3.0.0, 3.0.1
Fix Version/s: None
Component/s: SQL
Labels:
None

Target Version/s:

3.2.0

Description

~~SPARK-33933~~ report a issue that in AQE, when the resources is limited, broadcast timeout could happened.

#31269 gives a partial fix by reorder newStages by class type to make sure BroadcastQueryState precede others when calling materialized(). However, it only guarantee that the order of task to be scheduled in normal circumstances, but, the guarantee is not strict since the submit of broadcast job and shuffle map job are in different thread.

So we need a completely fix to avoid the edge case triggering broadcast timeout.