Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9960

Agent with cgroup support may destroy containers belonging to unrelated agents on startup

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.8.1, 1.9.0, master
    • None
    • containerization
    • None

    Description

      Let's say I have a mesos cluster with one master and one agent:

      $ mesos-master --work_dir=/tmp/mesos-master
      $ sudo mesos-agent --work_dir=/tmp/mesos-agent --master=127.0.1.1:5050 --port=5052 --isolation=docker/runtime
      

      where I'm running a simple sleep task:

      $ mesos-execute --command="sleep 10000" --master=127.0.1.1:5050 --name="sleep"
      I0904 18:40:25.020413 18321 scheduler.cpp:189] Version: 1.8.0
      I0904 18:40:25.020892 18319 scheduler.cpp:342] Using default 'basic' HTTP authenticatee
      I0904 18:40:25.021039 18323 scheduler.cpp:525] New master detected at master@127.0.1.1:5050
      Subscribed with ID 7d9f5030-cadd-49df-bf1e-daa97a4baab6-0000
      Submitted task 'sleep' to agent 'd59e934c-9e26-490d-9f4a-1e8b4ce06b4e-S1'
      Received status update TASK_STARTING for task 'sleep'
        source: SOURCE_EXECUTOR
      Received status update TASK_RUNNING for task 'sleep'
        source: SOURCE_EXECUTOR
      

      Next, I start a second agent on the same host as the first one:

      $ sudo ./src/mesos-agent --work_dir=/tmp/yyyy --master=example.org:5050 --isolation="linux/seccomp" --seccomp_config_dir=`pwd`/3rdparty/libseccomp-2.3.3
      

      During startup, this agent detects the container belonging to the other, unrelated agent and will attempt to clean it up:

      0904 18:30:44.906430 18067 task_status_update_manager.cpp:207] Recovering task status update manager
      I0904 18:30:44.906913 18071 containerizer.cpp:797] Recovering Mesos containers
      I0904 18:30:44.910077 18070 linux_launcher.cpp:286] Recovering Linux launcher
      I0904 18:30:44.910347 18070 linux_launcher.cpp:343] Recovered container 7f455ed7-6593-41e8-9b29-52ee84d7675b
      I0904 18:30:44.910409 18070 linux_launcher.cpp:437] 7f455ed7-6593-41e8-9b29-52ee84d7675b is a known orphaned container
      I0904 18:30:44.910877 18065 containerizer.cpp:1123] Recovering isolators
      I0904 18:30:44.911888 18064 containerizer.cpp:1162] Recovering provisioner
      I0904 18:30:44.913368 18068 provisioner.cpp:498] Provisioner recovery complete
      I0904 18:30:44.913630 18065 containerizer.cpp:1234] Cleaning up orphan container 7f455ed7-6593-41e8-9b29-52ee84d7675b
      I0904 18:30:44.913656 18065 containerizer.cpp:2576] Destroying container 7f455ed7-6593-41e8-9b29-52ee84d7675b in RUNNING state
      I0904 18:30:44.913666 18065 containerizer.cpp:3278] Transitioning the state of container 7f455ed7-6593-41e8-9b29-52ee84d7675b from RUNNING to DESTROYING
      I0904 18:30:44.914687 18064 linux_launcher.cpp:576] Asked to destroy container 7f455ed7-6593-41e8-9b29-52ee84d7675b
      I0904 18:30:44.914788 18064 linux_launcher.cpp:618] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/7f455ed7-6593-41e8-9b29-52ee84d7675b'
      

      killing the sleep task in the process:

      Received status update TASK_FAILED for task 'sleep'
        message: 'Executor terminated'
        source: SOURCE_AGENT
        reason: REASON_EXECUTOR_TERMINATED
      

      After some additional testing, it seems like the value of the `-isolation` flag is actually irrelevant: The same behaviour can be observed as long as cgroup support is enabled with `-systemd_enable_support`.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bennoe Benno Evers
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: