Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-10068

Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.7.3, 1.8.2, 1.9.1
    • None
    • master
    • None

    Description

      Hello,

       

      Looking at the documentation of the master /api/v1 endpoint, the SUBSCRIBE message says that only TASK_ADDED and TASK_UPDATED is supported for this endpoint, but when a new agent joins the cluster a AGENT_ADDED event is received.

      The problem is that when this agent is stopped the AGENT_REMOVED is not received by clients subscribed to the master API.

       

      I testes this behavior with versions: 1.7.3, 1.8.2 and 1.9.1. All using the docker image mesos/mesos-centos.

      The only way I saw a AGENT_REMOVED event was when a new agent joined the cluster but the master couldn't communicate with this agent, in this specific test there was a firewall blocking port 5051 on the slave, that is, no body was being able to tal to the slave on port 5051.

       

      Here are the steps do reproduce the problem

      • Start a new mesos master
      • Connect to the /api/v1 endpoint, sendingo a SUBSCRIBE message:
        • curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: application/json" http://MASTER_IP:5050/api/v1
      • Start a new slave and confirm the AGENT_ADDED event is delivered;
      • Stop this slave;
      • Checks that /slaves?slave_id=AGENT_ID returns a JSON response with the field active=false.
      • Waits for mesos master stop listing this slave, that is, /slaves?slave_id=AGENT_ID returns an empty response;

      Even after the empty response, the event never reaches the subscriber.

       

      The mesos master logs shows this:

       I1213 15:03:10.338935    13 master.cpp:1297] Agent 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 (86813ca2a964) disconnected
      I1213 15:03:10.339089    13 master.cpp:3399] Disconnecting agent 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 (86813ca2a964)
      I1213 15:03:10.339207    13 master.cpp:3418] Deactivating agent 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 (86813ca2a964)
      

      And then:

      W1213 15:04:40.726670    15 process.cpp:1917] Failed to send 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to connect to 172.18.0.51:5051: No route to host

      And some time after this:

      I1213 15:04:37.685007     7 hierarchical.cpp:900] Removed agent 2cd23025-c09d-401b-8f26-9265eda8f800-S1   

       

      Even after this removal, the AGENT_REMOVED event is not delivered.

       

      I will attach the full master logs also.

       

      Do you think this could be a bug?

      Attachments

        1. master-full-logs.log
          69 kB
          Dalton Matos Coelho Barreto

        Issue Links

          Activity

            People

              Unassigned Unassigned
              daltonmatos Dalton Matos Coelho Barreto
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: