Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9719

Test `AgentFailoverHTTPExecutorUsingResourceProviderResources` is flaky.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.8.0, 1.9.0
    • None
    • None
    • Storage: RI-12 Sprint 43
    • 2

    Description

      The test is flaky because:

      1. It assumes the mock RP never reregisters, which might not be true.
      2. It does not wait for the task and executor to be reaped, which would lead to a race between containerizer destroy and test teardown and cause cgroups cleanup to fail.
      3. It fast-forwards the clock, which might lead to containerizer destroy failures.
      4. It assumes that the framework only receives two status updates, which might not be true.

      Example failure log:

      E0410 00:18:23.526867  1251 slave.cpp:3118] Failed to update resources for container f941cb68-9f13-418c-be1b-702e5927b1eb of executor 'default' of framework ca96f624-9590-4776-9e83-39714cebd25f-0000, destroying container: Collect failed: Failed to publish resources 'disk(allocated: foo)[RAW]:200' for container f941cb68-9f13-418c-be1b-702e5927b1eb: Resource provider 616834b9-4dbb-45a7-b762-831ce5e8534a is not subscribed
      I0410 00:18:23.526957  1251 containerizer.cpp:2576] Destroying container f941cb68-9f13-418c-be1b-702e5927b1eb in RUNNING state
      I0410 00:18:23.526979  1251 containerizer.cpp:3278] Transitioning the state of container f941cb68-9f13-418c-be1b-702e5927b1eb from RUNNING to DESTROYING
      I0410 00:18:23.526989  1251 containerizer.cpp:2576] Destroying container f941cb68-9f13-418c-be1b-702e5927b1eb.523acde5-8c21-4f3f-af71-7cb84b54803e in RUNNING state
      I0410 00:18:23.526996  1251 containerizer.cpp:3278] Transitioning the state of container f941cb68-9f13-418c-be1b-702e5927b1eb.523acde5-8c21-4f3f-af71-7cb84b54803e from RUNNING to DESTROYING
      I0410 00:18:23.527102  1251 linux_launcher.cpp:576] Asked to destroy container f941cb68-9f13-418c-be1b-702e5927b1eb.523acde5-8c21-4f3f-af71-7cb84b54803e
      ...
      E0410 00:18:23.535424  1246 slave.cpp:6572] Termination of executor 'default' of framework ca96f624-9590-4776-9e83-39714cebd25f-0000 failed: Failed to destroy nested containers: Failed to kill all processes in the container: Timed out after 1mins
      ...
      I0410 00:18:23.535817  1252 master.cpp:8983] Executor 'default' of framework ca96f624-9590-4776-9e83-39714cebd25f-0000 on agent ca96f624-9590-4776-9e83-39714cebd25f-S0 at slave(699)@172.16.10.211:33823 (ip-172-16-10-211.ec2.internal): wait status -1
      ...
      ../../src/tests/mesos.cpp:926: Failure
      (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/memory/mesos_test_b1965800-016c-494b-8d6d-c70437c9405f/f941cb68-9f13-418c-be1b-702e5927b1eb': Device or resource busy
      

      Attachments

        Activity

          People

            chhsia0 Chun-Hung Hsiao
            chhsia0 Chun-Hung Hsiao
            Benjamin Bannier Benjamin Bannier
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: