Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8346

Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.5.0
    • 1.5.0
    • None

    Description

      A resource provider might resubscribe while its old HTTP connection wasn't properly closed. In that case an agent will crashm with, e.g., the following log:

      I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource provider {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"}
      I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource provider 8e71beef-796e-4bde-9257-952ed0f230a5
      I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource provider 8e71beef-796e-4bde-9257-952ed0f230a5
      E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received
      I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5'
      I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
      F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: resourceProviders.subscribed.contains(resourceProviderId) 
      *** Check failure stack trace: ***
      E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received
      I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
          @        0x1125380ef  google::LogMessageFatal::~LogMessageFatal()
          @        0x112534ae9  google::LogMessageFatal::~LogMessageFatal()
      I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation for 1 agents in 61830ns
      I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating
      I1219 13:33:51.945955 129146880 master.cpp:1305] Agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 (172.18.8.13) disconnected
      I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 (172.18.8.13)
      I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 (172.18.8.13)
      I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated
          @        0x115f2761d  mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()()
          @        0x115f2977d  _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_
          @        0x115f29740  _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEE13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0EEEEDTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_EEEEOSN_OSO_N5cpp1416integer_sequenceImJXspT2_EEEEOSP_
          @        0x115f296bb  _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_EEEEESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0EEEE_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_EEEEDpOSY_
          @        0x115f2965d  _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_
          @        0x115f29631  _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS6_14HttpConnectionERKNS5_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEEJEEEvOT_DpOT0_
          @        0x115f29526  _ZNO6lambda12CallableOnceIFvvEE10CallableFnINS_8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS7_14HttpConnectionERKNS6_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEEEEEEclEv
          @        0x10b6ca690  _ZNO6lambda12CallableOnceIFvvEEclEv
          @        0x10be09295  _ZZN7process8internal8DispatchIvEclIN6lambda12CallableOnceIFvvEEEEEvRKNS_4UPIDEOT_ENKUlOS7_PNS_11ProcessBaseEE_clESD_SF_
          @        0x10be09180  _ZN5cpp176invokeIZN7process8internal8DispatchIvEclIN6lambda12CallableOnceIFvvEEEEEvRKNS1_4UPIDEOT_EUlOS9_PNS1_11ProcessBaseEE_JS9_SH_EEEDTclclsr3stdE7forwardISD_Efp_Espclsr3stdE7forwardIT0_Efp0_EEESE_DpOSJ_
          @        0x10be0912b  _ZN6lambda8internal7PartialIZN7process8internal8DispatchIvEclINS_12CallableOnceIFvvEEEEEvRKNS2_4UPIDEOT_EUlOS9_PNS2_11ProcessBaseEE_JS9_NSt3__112placeholders4__phILi1EEEEE13invoke_expandISI_NSJ_5tupleIJS9_SM_EEENSP_IJOSH_EEEJLm0ELm1EEEEDTclsr5cpp17E6invokeclsr3stdE7forwardISD_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_EEEESE_OST_N5cpp1416integer_sequenceImJXspT2_EEEEOSU_
          @        0x10be0905f  _ZNO6lambda8internal7PartialIZN7process8internal8DispatchIvEclINS_12CallableOnceIFvvEEEEEvRKNS2_4UPIDEOT_EUlOS9_PNS2_11ProcessBaseEE_JS9_NSt3__112placeholders4__phILi1EEEEEclIJSH_EEEDTcl13invoke_expandclL_ZNSJ_4moveIRSI_EEONSJ_16remove_referenceISD_E4typeESE_EdtdefpT1fEclL_ZNSP_IRNSJ_5tupleIJS9_SM_EEEEESU_SE_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1EEEE_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_EEEEDpOS11_
          @        0x10be08f4d  _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8internal8DispatchIvEclINS1_12CallableOnceIFvvEEEEEvRKNS4_4UPIDEOT_EUlOSB_PNS4_11ProcessBaseEE_JSB_NSt3__112placeholders4__phILi1EEEEEEJSJ_EEEDTclclsr3stdE7forwardISF_Efp_Espclsr3stdE7forwardIT0_Efp0_EEESG_DpOSQ_
          @        0x10be08f11  _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8internal8DispatchIvEclINS_12CallableOnceIFvvEEEEEvRKNS5_4UPIDEOT_EUlOSC_PNS5_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1EEEEEEJSK_EEEvSH_DpOT0_
          @        0x10be08d36  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchIvEclINS0_IFvvEEEEEvRKNS1_4UPIDEOT_EUlOSE_S3_E_JSE_NSt3__112placeholders4__phILi1EEEEEEEclEOS3_
          @        0x11fd64bc9  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
          @        0x11fd64a69  process::ProcessBase::consume()
          @        0x11fe20ac4  _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
          @        0x113c77819  process::ProcessBase::serve()
          @        0x11fd5b8c9  process::ProcessManager::resume()
          @        0x11fe8260b  process::ProcessManager::init_threads()::$_1::operator()()
          @        0x11fe82190  _ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEZN7process14ProcessManager12init_threadsEvE3$_1EEEEEPvSB_
          @     0x7fff64da56c1  _pthread_body
          @     0x7fff64da556d  _pthread_start
          @     0x7fff64da4c5d  thread_start
      Abort trap: 6
      

      This is due to a race condition in resource_provider/manager.cpp when handling closed HTTP connections of resource providers. If a resource provider resubscribes and its old HTTP connection is still open, the resource provider manager will close it. This is unexpected and will trigger closing the new HTTP connection which results in a failed CHECK.

      Attachments

        Activity

          People

            nfnt Jan Schlicht
            nfnt Jan Schlicht
            Benjamin Bannier Benjamin Bannier
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: