Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7847

Master failover, the 'slaves.recovered' struct contain unvalid slaveID when slave reregistered.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.0.0
    • None
    • master
    • os: ubuntu 14.04 mesos-1.0.0 version
      mesos master(2) + mesos agent(1)

    • Patch, Important
    • Hide
      master1 log:
      " 6698 I0802 15:35:57.276021 20539 master.:238] 20534,_shutdown]Shutting down agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 due to health check timeout
         6699 W0802 15:35:57.276119 20539 master.:5474] 20534,shutdownSlave]Shutting down agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.1 24.157:5051 (10.175.124.157) with message 'health check timed out'
         6700 I0802 15:35:57.276173 20539 master.:6641] 20534,removeSlave]Removing agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157: 5051 (10.175.124.157): health check timed out
         6701 I0802 15:35:57.276748 20542 allocat:510] 20534,removeSlave]Removed agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0
         6702 I0802 15:35:57.277498 20539 registr:468] 20534,update]Applied 1 operations in 75998ns; attempting to update the 'registry'
         6703 I0802 15:35:57.279986 20543 registr:513] 20534,_update]Successfully updated the 'registry' in 2.433082ms
         6704 I0802 15:35:57.280320 20539 master.:6759] 20534,_removeSlave]Removed agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157): health check timed out"

      master2 log:
      I0802 15:37:39.663698 20747 registr:468] 20745,update]Applied 1 operations in 34942ns; attempting to update the 'registry'
      I0802 15:37:39.667167 20747 registr:513] 20745,_update]Successfully updated the 'registry' in 3.416368ms
      I0802 15:37:39.667223 20747 registr:399] 20745,__recover]Successfully recovered registrar
      I0802 15:37:39.667393 20747 master.:1552] 20745,_recover]Recovered 0 agents from the Registry (143B) ; allowing 10mins for agents to re-register
      ...................................................................
      I0802 15:38:01.305759 20750 master.:4925] 20745,reregisterSlave]Re-registering agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 (10.175.124.157)
      I0802 15:38:01.306411 20750 registr:468] 20745,update]Applied 1 operations in 63487ns; attempting to update the 'registry'
      I0802 15:38:01.310890 20750 registr:513] 20745,_update]Successfully updated the 'registry' in 4.41893ms
      I0802 15:38:01.312330 20749 master.:5031] 20745,_reregisterSlave]Re-registered agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 (10.175.124.157) with cpus(*):4; mem(*):2912; disk(*):24987; ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3]
      I0802 15:38:01.312338 20752 allocat:478] 20745,addSlave]Added agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157) with cpus(*):4; mem(*):2912; disk(*):24987; ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3] (allocated: )
      I0802 15:38:01.312372 20749 master.:5099] 20745,__reregisterSlave]Sending updated checkpointed resources to agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 (10.175.124.157)
      I0802 15:38:01.313002 20751 master.:5161] 20745,updateSlave]Received update of agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 (10.175.124.157) with total oversubscribed resources
      I0802 15:38:01.313367 20751 allocat:542] 20745,updateSlave]Agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157) updated with oversubscribed resources (total: cpus(*):4; mem(*):2912; disk(*):24987; ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3], allocated: )
      Show
      master1 log: " 6698 I0802 15:35:57.276021 20539 master.:238] 20534,_shutdown]Shutting down agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 due to health check timeout    6699 W0802 15:35:57.276119 20539 master.:5474] 20534,shutdownSlave]Shutting down agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.1 24.157:5051 (10.175.124.157) with message 'health check timed out'    6700 I0802 15:35:57.276173 20539 master.:6641] 20534,removeSlave]Removing agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157: 5051 (10.175.124.157): health check timed out    6701 I0802 15:35:57.276748 20542 allocat:510] 20534,removeSlave]Removed agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0    6702 I0802 15:35:57.277498 20539 registr:468] 20534,update]Applied 1 operations in 75998ns; attempting to update the 'registry'    6703 I0802 15:35:57.279986 20543 registr:513] 20534,_update]Successfully updated the 'registry' in 2.433082ms    6704 I0802 15:35:57.280320 20539 master.:6759] 20534,_removeSlave]Removed agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157): health check timed out" master2 log: I0802 15:37:39.663698 20747 registr:468] 20745,update]Applied 1 operations in 34942ns; attempting to update the 'registry' I0802 15:37:39.667167 20747 registr:513] 20745,_update]Successfully updated the 'registry' in 3.416368ms I0802 15:37:39.667223 20747 registr:399] 20745,__recover]Successfully recovered registrar I0802 15:37:39.667393 20747 master.:1552] 20745,_recover]Recovered 0 agents from the Registry (143B) ; allowing 10mins for agents to re-register ................................................................... I0802 15:38:01.305759 20750 master.:4925] 20745,reregisterSlave]Re-registering agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 (10.175.124.157) I0802 15:38:01.306411 20750 registr:468] 20745,update]Applied 1 operations in 63487ns; attempting to update the 'registry' I0802 15:38:01.310890 20750 registr:513] 20745,_update]Successfully updated the 'registry' in 4.41893ms I0802 15:38:01.312330 20749 master.:5031] 20745,_reregisterSlave]Re-registered agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 (10.175.124.157) with cpus(*):4; mem(*):2912; disk(*):24987; ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3] I0802 15:38:01.312338 20752 allocat:478] 20745,addSlave]Added agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157) with cpus(*):4; mem(*):2912; disk(*):24987; ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3] (allocated: ) I0802 15:38:01.312372 20749 master.:5099] 20745,__reregisterSlave]Sending updated checkpointed resources to agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 (10.175.124.157) I0802 15:38:01.313002 20751 master.:5161] 20745,updateSlave]Received update of agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 (10.175.124.157) with total oversubscribed resources I0802 15:38:01.313367 20751 allocat:542] 20745,updateSlave]Agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157) updated with oversubscribed resources (total: cpus(*):4; mem(*):2912; disk(*):24987; ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3], allocated: )

    Description

      we run two mesos-masters and one mesos-slave, in order to test mesos HA. We do as follow steps:
      1. startup master1(leader), master2(follower) and agent, the agent successful registerd to master1;
      2. shutdown agent, after 75s(the master and agent ping-pong is 15s*5) shutdown master1, as a result master1 remove agent from 'registry' and send 'ShutdownMessage' to agent, but agent process have terminated, so it can't receive this message;
      3. master1 is terminated and master2 is leader now, it recovered from registry with 0 agent, meanwhile the agent is restarted, the agent 'reregisteded' to master2 is success.
      So the issue is when master recovered from registry with 0 agent, the 'slaves.recovered' strcut don't contain this Slave, it should not admit reregistered success, maybe should send 'ShutdownMessage' to agent.

      Attachments

        Activity

          People

            Unassigned Unassigned
            AndyPang AndyPang
            Jie Yu Jie Yu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified