Description
After master failover if an agent attempts to reregister while it is being marked as unreachable and reregistration finishes before the MarkUnreachable operation is complete, the assertion that the agent is in the recovered set in Master::_markUnreachable() [1] fails. When readmitting the agent the master removes it from the recovered set in Master::__reregisterSlave() [2]. If __reregisterSlave() is executed before _markUnreachable(), it breaks the assertion.
Example:
I1215 02:10:02.657672 498611 master.cpp:2170] Elected as the leading master! I1215 02:10:08.415233 498563 master.cpp:1819] Recovered ??? agents from the registry (???B); allowing 10mins for agents to reregister I1215 02:20:08.128789 498569 master.cpp:2037] Scheduling removal of agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50); did not reregister within 10mins after master failover I1215 02:20:16.480931 498596 master.cpp:9469] Marking agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) unreachable: did not reregister within 10mins after master failover I1215 02:20:16.864944 498560 master.cpp:7439] Received reregister agent message from agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 (meta-slave-test-3-82-50) I1215 02:20:16.865509 498560 master.cpp:7980] Re-registered agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 (meta-slave-test-3-82-50) with cpus:64; mem:32000; disk:320000; ports:[31000-32000] I1215 02:20:16.869235 498553 master.cpp:8370] Received update of agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 (meta-slave-test-3-82-50) with total oversubscribed resources {} I1215 02:20:16.869263 498553 master.cpp:8487] Ignoring update on agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 (meta-slave-test-3-82-50) as it reports no changes I1215 02:20:16.869755 498605 hierarchical.cpp:854] Added agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) with cpus:64; mem:32000; disk:320000; ports:[31000-32000] (allocated: {}) I1215 02:20:22.541494 498591 master.cpp:9512] Marked agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) unreachable: did not reregister within 10mins after master failover F1215 02:20:22.541508 498591 master.cpp:9523] Check failed: slaves.recovered.contains(slave.id()) *** Check failure stack trace: *** @ 0x7fcda8a90fdd google::LogMessage::Fail() @ 0x7fcda8a93263 google::LogMessage::SendToLog() @ 0x7fcda8a90b59 google::LogMessage::Flush() @ 0x7fcda8a93c69 google::LogMessageFatal::~LogMessageFatal() @ 0x7fcda75d05d8 mesos::internal::master::Master::_markUnreachable() @ 0x7fcda75d083d (unknown) @ 0x7fcda72b0f93 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_ @ 0x7fcda89f68f1 process::ProcessBase::consume() @ 0x7fcda8a0f09b process::ProcessManager::resume() @ 0x7fcda8a15986 (unknown) @ 0x7fcda45ce070 (unknown) @ 0x7fcda4c33ea5 start_thread @ 0x7fcda3d318dd __clone
[1] https://github.com/apache/mesos/blob/1.11.0/src/master/master.cpp#L8698
[2] https://github.com/apache/mesos/blob/1.11.0/src/master/master.cpp#L7110