Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-10107

containeriser: failed to remove cgroup - EBUSY

    XMLWordPrintableJSON

Details

    • Patch, Important

    Description

      We've been seeing some random errors on our cluster, where the container cgroup isn't properly destroyed after the OOM killer kicked in when memory limit has been exceeded - see analysis and patch below:

      Agent log:

      I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
      I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 10272MB Maximum Used: 10518532KB
      
      MEMORY STATISTICS: 
      cache 0
      rss 10502754304
      rss_huge 4001366016
      shmem 0
      mapped_file 270336
      dirty 0
      writeback 0
      swap 0
      pgpgin 1684617
      pgpgout 95480
      pgfault 1670328
      pgmajfault 957
      inactive_anon 0
      active_anon 10501189632
      inactive_file 4096
      active_file 0
      unevictable 0
      hierarchical_memory_limit 10770972672
      hierarchical_memsw_limit 10770972672
      total_cache 0
      total_rss 10502754304
      total_rss_huge 4001366016
      total_shmem 0
      total_mapped_file 270336
      total_dirty 0
      total_writeback 0
      total_swap 0
      total_pgpgin 1684617
      total_pgpgout 95480
      total_pgfault 1670328
      total_pgmajfault 957
      total_inactive_anon 0
      total_active_anon 10501070848
      total_inactive_file 4096
      total_active_file 0
      total_unevictable 0
      I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be terminated
      I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
      I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING after 4.285078272secs
      I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
      I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
      I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
      I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 102.27072ms
      I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
      I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 242944ns
      I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for executor(1)@127.0.1.1:46357
      I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
      E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all processes in the container: Failed to remove cgroup 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device or resource busy
      

      Initially I thought it was a race condition in the cgroup destruction code, but an strace confirmed that the cgroup directory was only deleted once all tasks had exited (edited and commented strace below from a different instance of the same problem):

      # get the list of processes
      3431  23:01:32.293608 openat(AT_FDCWD,
      "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
      O_RDONLY <unfinished ...>
      3431  23:01:32.293669 <... openat resumed> ) = 18 <0.000036>
      3431  23:01:32.294220 read(18,  <unfinished ...>
      3431  23:01:32.294268 <... read resumed> "5878\n6036\n6210\n", 8192) =
      15 <0.000033>
      3431  23:01:32.294306 read(18, "", 4096) = 0 <0.000013>
      3431  23:01:32.294346 close(18 <unfinished ...>
      3431  23:01:32.294402 <... close resumed> ) = 0 <0.000045>
      #kill them
      3431  23:01:32.296266 kill(5878, SIGKILL) = 0 <0.000019>
      3431  23:01:32.296384 kill(6036, SIGKILL <unfinished ...>
      3431  23:01:32.296474 kill(6210, SIGKILL) = 0 <0.000013>
      6210  23:01:32.299159 +++ killed by SIGKILL +++
      5878  23:01:32.327551 +++ killed by SIGKILL +++
      3431  23:01:32.327601 --- SIGCHLD {si_signo=SIGCHLD,
      si_code=CLD_KILLED, si_pid=5878, si_uid=1000, si_status=SIGKILL,
      si_utime=18, si_stime=4} ---
      6036  23:01:32.327923 +++ killed by SIGKILL +++
      # wait them
      3429  23:01:32.338912 wait4(5878,  <unfinished ...>
      3429  23:01:32.339078 <... wait4 resumed> [{WIFSIGNALED(s) &&
      WTERMSIG(s) == SIGKILL}], WNOHANG, NULL) = 5878 <0.000157>
      3429  23:01:32.339944 wait4(6036, 0x7fcfb28ee5e0, WNOHANG, NULL) = -1
      ECHILD (Aucun processus enfant) <0.000005>
      3429  23:01:32.340938 wait4(6210,  <unfinished ...>
      3429  23:01:32.341165 <... wait4 resumed> 0x7fcfb28ee5e0, WNOHANG,
      NULL) = -1 ECHILD (Aucun processus enfant) <0.000218>
      # check cgroup empty
      3426  23:01:32.341490 openat(AT_FDCWD,
      "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
      O_RDONLY <unfinished ...>
      3426  23:01:32.342474 read(47,  <unfinished ...>
      3426  23:01:32.342842 <... read resumed> "", 8192) = 0 <0.000359>
      3426  23:01:32.342940 close(47 <unfinished ...>
      3426  23:01:32.344579 <... close resumed> ) = 0 <0.001631>
      # cgroup empty - attempt to delete
      3428  23:01:32.345596
      rmdir("/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb"
      <unfinished ...>
      3428  23:01:32.345679 <... rmdir resumed> ) = -1 EBUSY (Device or resource busy) <0.000068>
      

      So as can be seen, the cgroup destruction code correctly waits until the tasks has exited to destroy the cgroup, which is empty as can be seen by the code reading from cgroup.procs right before the rmdir.

      It turns out that it's actually a kernel bug recently reported and fixed: https://lkml.org/lkml/2020/1/15/1349

      See reproducer attached - it repeatedly starts a task consuming several GB of memory, with a memory limit below slightly below to trigger OOM.

      It's a kernel bug, but since it affects virtually every kernel/Mesos currently in use I think it would make sense to add a workaround - simply retrying to delete the cgroup after a timeout fixes it for me, see patch attached.

      qianzhang abudnik

      Attachments

        1. reproduce-cgroup-rmdir-race.py
          3 kB
          Charles N
        2. mesos-remove-cgroup-race.diff
          1 kB
          Charles N

        Activity

          People

            cf.natali Charles Natali
            Charle Charles N
            Andrei Budnik Andrei Budnik
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: