Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
A Mesos agent can get stuck in the Draining mode caused by pending unacknowledged status updates. When the framework becomes disconnected, the agent keeps sending task status updates for terminated tasks of that framework. This leads to a problem when the agent gets stuck in the Draining state because the master transitions the agent from DRAINING to DRAINED state only after all task status updates get acknowledged.
This problem can be resolved by sending "Teardown" operation for all lost frameworks. However, it would be much better if this situation could be handled automatically by the Master. At least, we should make it easier for an operator to find out what prevents draining operation to complete.