[MESOS-6136] Duplicate framework id handling - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 0.28.1
Fix Version/s: None
Component/s: None
Labels:
Environment:

DCOS 1.7 Cloud Formation scripts

Description

We have observed a situation where Mesos will kill tasks belonging to a framework where that framework times out with the Mesos master for some reason, perhaps even because of a network partition.

While we can provide a long timeout so that Mesos will not kill a framework's tasks for practical purposes, I'm wondering if there's an improvement where a framework shouldn't be permitted to re-register for a given id (as now), but Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be "told" by an operator that this condition should be cleared.

IMHO frameworks should be the only entity requesting that tasks be killed unless manually overridden by an operator.

I'm flagging this as a critical improvement because a) the focus should be on keeping tasks running in a system, and it isn't; and b) Mesos is working as designed.

In summary I feel that Mesos is taking on a responsibility in killing tasks where it shouldn't be.

Attachments

Issue Links

is related to

MESOS-4659 Avoid leaving orphan task after framework failure + master failover

Accepted

Activity

People

Assignee:: Unassigned

Reporter:: Christopher Hunt

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 07/Sep/16 23:32

Updated:: 26/Apr/17 16:52