Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9162

Unkillable pod container stuck in ISOLATING

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.6.0, 1.7.0
    • None
    • containerization
    • Mesosphere Sprint 2018-27
    • 5

    Description

      We have a simple test that launches a pod with two containers (one writes in a file and the other reads it). This test is flaky because the container sometimes fails to start.
      Marathon app definition:

      {
        "id": "/simple-pod",
        "scaling": {
          "kind": "fixed",
          "instances": 1
        },
        "environment": {
          "PING": "PONG"
        },
        "containers": [
          {
            "name": "ct1",
            "resources": {
              "cpus": 0.1,
              "mem": 32
            },
            "image": {
              "kind": "DOCKER",
              "id": "busybox"
            },
            "exec": {
              "command": {
                "shell": "while true; do echo the current time is $(date) > ./test-v1/clock; sleep 1; done"
              }
            },
            "volumeMounts": [
              {
                "name": "v1",
                "mountPath": "test-v1"
              }
            ]
          },
          {
            "name": "ct2",
            "resources": {
              "cpus": 0.1,
              "mem": 32
            },
            "exec": {
              "command": {
                "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 1; done"
              }
            },
            "volumeMounts": [
              {
                "name": "v1",
                "mountPath": "etc"
              },
              {
                "name": "v2",
                "mountPath": "docker"
              }
            ]
          }
        ],
        "networks": [
          {
            "mode": "host"
          }
        ],
        "volumes": [
          {
            "name": "v1"
          },
          {
            "name": "v2",
            "host": "/var/lib/docker"
          }
        ]
      }
      

      During the test, Marathon tries to launch the pod but doesn't receive a TASK_RUNNING for the first container and so after 2min decides to kill the pod which also fails.

      Agent sandbox (attached to this ticket minus docker layers, since they're too big to attach) shows that one of the containers wasn't started properly - the last line in the agent log says:

      Transitioning the state of container ff4f4fdc-9327-42fb-be40-29e919e15aee.e9b05652-e779-46f8-9b76-b2e1ce7e5940 from PREPARING to ISOLATING
      

      Until then the log looks pretty unspektakular.

      Afterwards, Marathon tries to kill the container repeatedly, but doesn't succeed - the executor receives the reuests but doesn't send anything back:

      I0816 22:52:53.111995     4 default_executor.cpp:204] Received SUBSCRIBED event
      I0816 22:52:53.112520     4 default_executor.cpp:208] Subscribed executor on 10.10.0.222
      I0816 22:52:53.112783     4 default_executor.cpp:204] Received LAUNCH_GROUP event
      I0816 22:52:53.116516    11 default_executor.cpp:428] Setting 'MESOS_CONTAINER_IP' to: 10.10.0.222
      I0816 22:52:53.169596     4 default_executor.cpp:204] Received ACKNOWLEDGED event
      I0816 22:52:53.194416    10 default_executor.cpp:204] Received ACKNOWLEDGED event
      I0816 22:54:50.559470     8 default_executor.cpp:204] Received KILL event
      I0816 22:54:50.559496     8 default_executor.cpp:1251] Received kill for task 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct1'
      I0816 22:54:50.559737     4 default_executor.cpp:204] Received KILL event
      I0816 22:54:50.559751     4 default_executor.cpp:1251] Received kill for task 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct2'
      ...
      

      Relevant Ids for grepping the logs:

      Marathon app id: /simple-pod-bcc8f180b611494aa972520b8b650ca9
      Mesos tasks id: simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct1
      Mesos container id: e9b05652-e779-46f8-9b76-b2e1ce7e5940
      

      Attachments

        1. dcos-marathon.service.log
          3.39 MB
          A. Dukhovniy
        2. dcos-mesos-master.service.gz
          667 kB
          A. Dukhovniy
        3. dcos-mesos-slave.service.gz
          522 kB
          A. Dukhovniy
        4. diagnostics.zip
          7.56 MB
          A. Dukhovniy
        5. sandbox_10_10_0_222_var_lib.tar.gz
          332 kB
          A. Dukhovniy

        Issue Links

          Activity

            People

              gilbert Gilbert Song
              zen-dog A. Dukhovniy
              Qian Zhang Qian Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: