Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.7.3, 1.8.2, 1.9.1
-
None
-
None
Description
Hello,
Looking at the documentation of the master /api/v1 endpoint, the SUBSCRIBE message says that only TASK_ADDED and TASK_UPDATED is supported for this endpoint, but when a new agent joins the cluster a AGENT_ADDED event is received.
The problem is that when this agent is stopped the AGENT_REMOVED is not received by clients subscribed to the master API.
I testes this behavior with versions: 1.7.3, 1.8.2 and 1.9.1. All using the docker image mesos/mesos-centos.
The only way I saw a AGENT_REMOVED event was when a new agent joined the cluster but the master couldn't communicate with this agent, in this specific test there was a firewall blocking port 5051 on the slave, that is, no body was being able to tal to the slave on port 5051.
Here are the steps do reproduce the problem
- Start a new mesos master
- Connect to the /api/v1 endpoint, sendingo a SUBSCRIBE message:
curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: application/json" http://MASTER_IP:5050/api/v1
- Start a new slave and confirm the AGENT_ADDED event is delivered;
- Stop this slave;
- Checks that /slaves?slave_id=AGENT_ID returns a JSON response with the field active=false.
- Waits for mesos master stop listing this slave, that is, /slaves?slave_id=AGENT_ID returns an empty response;
Even after the empty response, the event never reaches the subscriber.
The mesos master logs shows this:
I1213 15:03:10.338935 13 master.cpp:1297] Agent 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 (86813ca2a964) disconnected I1213 15:03:10.339089 13 master.cpp:3399] Disconnecting agent 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 (86813ca2a964) I1213 15:03:10.339207 13 master.cpp:3418] Deactivating agent 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 (86813ca2a964)
And then:
W1213 15:04:40.726670 15 process.cpp:1917] Failed to send 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to connect to 172.18.0.51:5051: No route to host
And some time after this:
I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 2cd23025-c09d-401b-8f26-9265eda8f800-S1
Even after this removal, the AGENT_REMOVED event is not delivered.
I will attach the full master logs also.
Do you think this could be a bug?
Attachments
Attachments
Issue Links
- is fixed by
-
MESOS-10089 AGENT_REMOVED event not sent when agents marked GONE
- Open