Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9507

Agent could not recover due to empty docker volume checkpointed files.

    XMLWordPrintableJSON

Details

    • Containerization RI10 Spr 39, Containerization RI11 Sp 40
    • 5

    Description

      Agent could not recover due to empty docker volume checkpointed files. Please see logs:

      Nov 12 17:12:00 guppy mesos-agent[38960]: E1112 17:12:00.978682 38969 slave.cpp:6279] EXIT with status 1: Failed to perform recovery: Collect failed: Collect failed: Failed to recover docker volumes for orphan container e1b04051-1e4a-47a9-b866-1d625cda1d22: JSON parse failed: syntax error at line 1 near:
      Nov 12 17:12:00 guppy mesos-agent[38960]: To remedy this do as follows: 
      Nov 12 17:12:00 guppy mesos-agent[38960]: Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
      Nov 12 17:12:00 guppy mesos-agent[38960]: This ensures agent doesn't recover old live executors.
      Nov 12 17:12:00 guppy mesos-agent[38960]: Step 2: Restart the agent. 
      Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service: main process exited, code=exited, status=1/FAILURE
      Nov 12 17:12:00 guppy systemd[1]: Unit dcos-mesos-slave.service entered failed state.
      Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service failed.
      

      This might happen after hard reboot. Docker volume isolator uses `state::checkpoint()` function which creates a temporary file, then writes the data, then renames the temporary file to destination file. This function is atomic and supports `fsync` for the data. However, Docker volume isolator does not use `fsync` option for performance reasons, hence the data might be lost if page cache is not synced before reboot.

      Basically the docker volume is not mounted yet, so the docker volume isolator should skip recovering this volume.

      Attachments

        Activity

          People

            qianzhang Qian Zhang
            gilbert Gilbert Song
            Gilbert Song Gilbert Song
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: