Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-10209

Agent reregistration and marking race

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.11.0
    • 1.12.0
    • master
    • None

    Description

      After master failover if an agent attempts to reregister while it is being marked as unreachable and reregistration finishes before the MarkUnreachable operation is complete, the assertion that the agent is in the recovered set in Master::_markUnreachable() [1] fails. When readmitting the agent the master removes it from the recovered set in Master::__reregisterSlave() [2]. If __reregisterSlave() is executed before _markUnreachable(), it breaks the assertion.

      Example:

      I1215 02:10:02.657672 498611 master.cpp:2170] Elected as the leading master!
      I1215 02:10:08.415233 498563 master.cpp:1819] Recovered ??? agents from the registry (???B); allowing 10mins for agents to reregister
      I1215 02:20:08.128789 498569 master.cpp:2037] Scheduling removal of agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50); did not reregister within 10mins after master failover
      I1215 02:20:16.480931 498596 master.cpp:9469] Marking agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) unreachable: did not reregister within 10mins after master failover
      I1215 02:20:16.864944 498560 master.cpp:7439] Received reregister agent message from agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 (meta-slave-test-3-82-50)
      I1215 02:20:16.865509 498560 master.cpp:7980] Re-registered agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 (meta-slave-test-3-82-50) with cpus:64; mem:32000; disk:320000; ports:[31000-32000]
      I1215 02:20:16.869235 498553 master.cpp:8370] Received update of agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 (meta-slave-test-3-82-50) with total oversubscribed resources {}
      I1215 02:20:16.869263 498553 master.cpp:8487] Ignoring update on agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 at slave(1)@10.1.2.3:31478 (meta-slave-test-3-82-50) as it reports no changes
      I1215 02:20:16.869755 498605 hierarchical.cpp:854] Added agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) with cpus:64; mem:32000; disk:320000; ports:[31000-32000] (allocated: {})
      I1215 02:20:22.541494 498591 master.cpp:9512] Marked agent 696ccc98-6b05-48ae-a4f9-0543d04423be-S49198 (meta-slave-test-3-82-50) unreachable: did not reregister within 10mins after master failover
      F1215 02:20:22.541508 498591 master.cpp:9523] Check failed: slaves.recovered.contains(slave.id())
      *** Check failure stack trace: ***
          @     0x7fcda8a90fdd  google::LogMessage::Fail()
          @     0x7fcda8a93263  google::LogMessage::SendToLog()
          @     0x7fcda8a90b59  google::LogMessage::Flush()
          @     0x7fcda8a93c69  google::LogMessageFatal::~LogMessageFatal()
          @     0x7fcda75d05d8  mesos::internal::master::Master::_markUnreachable()
          @     0x7fcda75d083d  (unknown)
          @     0x7fcda72b0f93  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
          @     0x7fcda89f68f1  process::ProcessBase::consume()
          @     0x7fcda8a0f09b  process::ProcessManager::resume()
          @     0x7fcda8a15986  (unknown)
          @     0x7fcda45ce070  (unknown)
          @     0x7fcda4c33ea5  start_thread
          @     0x7fcda3d318dd  __clone
      

      [1] https://github.com/apache/mesos/blob/1.11.0/src/master/master.cpp#L8698
      [2] https://github.com/apache/mesos/blob/1.11.0/src/master/master.cpp#L7110

      Attachments

        Activity

          People

            ipronin Ilya
            ipronin Ilya
            Benjamin Mahler Benjamin Mahler
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: