Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
We spent two days on a bug, which turns out to be an infinite loop in an actor, blocking other events from being processed by that actor.
Currently, the only way to know about a stuck actor is to use gdb. We should think about a way to print error logs when an actor has stuck for more than a threshold.
For instance, Linux kernel will print a warning in kernel log if a task is stuck for more than 120 seconds. Something like this will be extremely helpful.
Another way is to expose some metrics around this.
Attachments
Issue Links
- is related to
-
MESOS-9308 URI disk profile adaptor could deadlock.
- Resolved