Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8830

Agent gc on old slave sandboxes could empty persistent volume data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.3.1, 1.4.1, 1.5.0, 1.6.0
    • 1.4.2, 1.5.2, 1.6.1, 1.7.0
    • None
    • None

    Description

      We had an issue in which custom Cassandra executors (which does not use any container image thus running on host filesystem) saw its persistent volume data got wiped out.

      Upon revisiting logs, we found following suspicious lines:

      log
      I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max allowed age: 4.764742265646493days
      I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining removal time 2.23508429704593days
      I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining removal time 2.23508429587852days
      I0424 02:06:11.717183 10994 gc.cpp:133] Deleting /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
      I0424 02:06:11.727033 10994 gc.cpp:146] Deleted '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
      I0424 02:06:11.727094 10994 gc.cpp:133] Deleting /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
      I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
      E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume: Device or resource busy
      E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149: Directory not empty
      E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs: Directory not empty
      E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a: Directory not empty
      E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors: Directory not empty
      E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004: Directory not empty
      E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks: Directory not empty
      E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44: Directory not empty
      

      (I can try to provide more logs, depending on how much local archive after rotation has)

      This happened on a 1.3.1 agent although I suspect it's not local to that version.

      The path /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume is a bind mount to a persistent volume. The fact that agent gc touched that process makes me believe this is what triggered the data loss.

      We had some misconfigurations on out fleet, and I do not know whether the previous slave (id-ed 70279b87-553a-4213-a85b-46fdc191849d-S4) was shutdown cleanly yet.

      My questions/suggestions:

      1) if an executor was asked to shutdown by a new agent (with a new id), how much of persistent volume clean up code will be executed (especially if new agent do not really know this executor should be running anymore)?
      2) should we figure out a way to better protect bind mounts to persistent volumes in slave/gc.cpp (for instance, skip any path which seems dangerous), to prevent similar problems?

      Attachments

        Issue Links

          Activity

            People

              zhitao Zhitao Li
              zhitao Zhitao Li
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: