[MESOS-10107] containeriser: failed to remove cgroup - EBUSY - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.10.0
Component/s: containerization
Labels:
- cgroups
- containerization

Flags:

Patch, Important

Description

We've been seeing some random errors on our cluster, where the container cgroup isn't properly destroyed after the OOM killer kicked in when memory limit has been exceeded - see analysis and patch below:

Agent log:

I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 10272MB Maximum Used: 10518532KB

MEMORY STATISTICS: 
cache 0
rss 10502754304
rss_huge 4001366016
shmem 0
mapped_file 270336
dirty 0
writeback 0
swap 0
pgpgin 1684617
pgpgout 95480
pgfault 1670328
pgmajfault 957
inactive_anon 0
active_anon 10501189632
inactive_file 4096
active_file 0
unevictable 0
hierarchical_memory_limit 10770972672
hierarchical_memsw_limit 10770972672
total_cache 0
total_rss 10502754304
total_rss_huge 4001366016
total_shmem 0
total_mapped_file 270336
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 1684617
total_pgpgout 95480
total_pgfault 1670328
total_pgmajfault 957
total_inactive_anon 0
total_active_anon 10501070848
total_inactive_file 4096
total_active_file 0
total_unevictable 0
I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be terminated
I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING after 4.285078272secs
I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 102.27072ms
I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 242944ns
I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for executor(1)@127.0.1.1:46357
I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all processes in the container: Failed to remove cgroup 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device or resource busy

Initially I thought it was a race condition in the cgroup destruction code, but an strace confirmed that the cgroup directory was only deleted once all tasks had exited (edited and commented strace below from a different instance of the same problem):

# get the list of processes
3431  23:01:32.293608 openat(AT_FDCWD,
"/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
O_RDONLY <unfinished ...>
3431  23:01:32.293669 <... openat resumed> ) = 18 <0.000036>
3431  23:01:32.294220 read(18,  <unfinished ...>
3431  23:01:32.294268 <... read resumed> "5878\n6036\n6210\n", 8192) =
15 <0.000033>
3431  23:01:32.294306 read(18, "", 4096) = 0 <0.000013>
3431  23:01:32.294346 close(18 <unfinished ...>
3431  23:01:32.294402 <... close resumed> ) = 0 <0.000045>
#kill them
3431  23:01:32.296266 kill(5878, SIGKILL) = 0 <0.000019>
3431  23:01:32.296384 kill(6036, SIGKILL <unfinished ...>
3431  23:01:32.296474 kill(6210, SIGKILL) = 0 <0.000013>
6210  23:01:32.299159 +++ killed by SIGKILL +++
5878  23:01:32.327551 +++ killed by SIGKILL +++
3431  23:01:32.327601 --- SIGCHLD {si_signo=SIGCHLD,
si_code=CLD_KILLED, si_pid=5878, si_uid=1000, si_status=SIGKILL,
si_utime=18, si_stime=4} ---
6036  23:01:32.327923 +++ killed by SIGKILL +++
# wait them
3429  23:01:32.338912 wait4(5878,  <unfinished ...>
3429  23:01:32.339078 <... wait4 resumed> [{WIFSIGNALED(s) &&
WTERMSIG(s) == SIGKILL}], WNOHANG, NULL) = 5878 <0.000157>
3429  23:01:32.339944 wait4(6036, 0x7fcfb28ee5e0, WNOHANG, NULL) = -1
ECHILD (Aucun processus enfant) <0.000005>
3429  23:01:32.340938 wait4(6210,  <unfinished ...>
3429  23:01:32.341165 <... wait4 resumed> 0x7fcfb28ee5e0, WNOHANG,
NULL) = -1 ECHILD (Aucun processus enfant) <0.000218>
# check cgroup empty
3426  23:01:32.341490 openat(AT_FDCWD,
"/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
O_RDONLY <unfinished ...>
3426  23:01:32.342474 read(47,  <unfinished ...>
3426  23:01:32.342842 <... read resumed> "", 8192) = 0 <0.000359>
3426  23:01:32.342940 close(47 <unfinished ...>
3426  23:01:32.344579 <... close resumed> ) = 0 <0.001631>
# cgroup empty - attempt to delete
3428  23:01:32.345596
rmdir("/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb"
<unfinished ...>
3428  23:01:32.345679 <... rmdir resumed> ) = -1 EBUSY (Device or resource busy) <0.000068>

So as can be seen, the cgroup destruction code correctly waits until the tasks has exited to destroy the cgroup, which is empty as can be seen by the code reading from cgroup.procs right before the rmdir.

It turns out that it's actually a kernel bug recently reported and fixed: https://lkml.org/lkml/2020/1/15/1349

See reproducer attached - it repeatedly starts a task consuming several GB of memory, with a memory limit below slightly below to trigger OOM.

It's a kernel bug, but since it affects virtually every kernel/Mesos currently in use I think it would make sense to add a workaround - simply retrying to delete the cgroup after a timeout fixes it for me, see patch attached.

qianzhang abudnik

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

reproduce-cgroup-rmdir-race.py
31/Mar/20 20:49
3 kB
Charles N
mesos-remove-cgroup-race.diff
31/Mar/20 21:53
1 kB
Charles N

containeriser: failed to remove cgroup - EBUSY

Details

Description

Attachments

Attachments

Activity

People

Dates