[SPARK-46006] YarnAllocator miss clean targetNumExecutorsPerResourceProfileId after YarnSchedulerBackend call stop - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.3, 3.2.4, 3.3.2, 3.4.1, 3.5.0
Fix Version/s: 3.4.2, 4.0.0, 3.5.1, 3.3.4
Component/s: YARN
Labels:
- pull-request-available

Description

We meet a case that user call sc.stop() after run all custom code, but stuck in some place.

Cause below situation

User call sc.stop()
sc.stop() stuck in some process, but SchedulerBackend.stop was called
Since tarn ApplicationMaster didn't finish， still call YarnAllocator.allocateResources()
Since driver endpoint stop new allocated executor failed to register
untll trigger Max number of executor failures

Caused by

Before call CoarseGrainedSchedulerBackend.stop() will call YarnSchedulerBackend.requestTotalExecutor() to clean request info

From the log we make sure that CoarseGrainedSchedulerBackend.stop() was called

When YarnAllocator handle then empty resource request, since resourceTotalExecutorsWithPreferedLocalities is empty, miss clean targetNumExecutorsPerResourceProfileId.