Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9419

Executor to framework message crashes master if framework has not re-registered.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.2.0, 1.2.1, 1.2.2, 1.2.3, 1.3.0, 1.3.1, 1.3.2, 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.7.0
    • 1.4.3, 1.5.2, 1.6.2, 1.7.1, 1.8.0
    • master
    • None

    Description

      If the executor sends a framework message after a master failover, and the framework has not yet re-registered with the master, this will crash the master:

      W20181105 22:02:48.782819 172709 master.hpp:2304] Master attempted to send message to disconnected framework 03dc2603-acd6-491e-\ 8717-3f03e5ee37f4-0000 (Cook-1.24.0-9299b474217db499c9d28738050b359ac8dd55bb)
      F20181105 22:02:48.782830 172709 master.hpp:2314] CHECK_SOME(pid): is NONE
      *** Check failure stack trace: ***
      *** @ 0x7f09e016b6cd google::LogMessage::Fail()
      *** @ 0x7f09e016d38d google::LogMessage::SendToLog()
      *** @ 0x7f09e016b2b3 google::LogMessage::Flush()
      *** @ 0x7f09e016de09 google::LogMessageFatal::~LogMessageFatal()
      *** @ 0x7f09df086228 _CheckFatal::~_CheckFatal()
      *** @ 0x7f09df3a403d mesos::internal::master::Framework::send<>()
      *** @ 0x7f09df2f4886 mesos::internal::master::Master::executorMessage()
      *** @ 0x7f09df3b06a4 _ZN15ProtobufProcessIN5mesos8internal6master6MasterEE8handlerNINS1_26ExecutorToFrameworkMessageEJRKNS0\ _7SlaveIDERKNS0_11FrameworkIDERKNS0_10ExecutorIDERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEJS9_SC_SF_SN_EEEvPS3_MS3\ _FvRKN7process4UPIDEDpT1_ESS_SN_DpMT_KFT0_vE @ 0x7f09df345b43 std::_Function_handler<>::_M_invoke()
      *** @ 0x7f09df36930f ProtobufProcess<>::consume()
      *** @ 0x7f09df2e0ff5 mesos::internal::master::Master::_consume()
      *** @ 0x7f09df2f5542 mesos::internal::master::Master::consume()
      *** @ 0x7f09e00d9c7a process::ProcessManager::resume()
      *** @ 0x7f09e00dd836 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
      *** @ 0x7f09dd467ac8 execute_native_thread_routine
      *** @ 0x7f09dd6f6b50 start_thread
      *** @ 0x7f09dcc7030d (unknown)
      

      This is because Framework::send proceeds if the framework is disconnected. In the case of a recovered framework, it will not have a pid or http connection yet:

      https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.hpp#L2590-L2610

      // Sends a message to the connected framework.
      template <typename Message>
      void Framework::send(const Message& message)
      {
        if (!connected()) {
          LOG(WARNING) << "Master attempted to send message to disconnected"
                       << " framework " << *this;
          // XXX proceeds!
        }
      
        metrics.incrementEvent(message);
      
        if (http.isSome()) {
          if (!http->send(message)) {
            LOG(WARNING) << "Unable to send event to framework " << *this << ":"
                         << " connection closed";
          }
        } else {
          CHECK_SOME(pid); // XXX Will crash.
          master->send(pid.get(), message);
        }
      }
      

      The executor to framework path does not guard against the framework being disconnected, unlike the status update path:

      https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L6472-L6495

      vs.

      https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L8371-L8373

      It was reported that this crash didn't occur for the user on 1.2.0, however the issue appears to present there as well, so we will try to backport a test to see if it's indeed not occurring in 1.2.0.

      Attachments

        Issue Links

          Activity

            People

              chhsia0 Chun-Hung Hsiao
              bmahler Benjamin Mahler
              Benjamin Mahler Benjamin Mahler
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: