Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9436

Flaky test testApplicationLifetimeMonitor

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • scheduler, test
    • None

    Description

      In our test environment, we occasionally encounter this failure:

      2019-04-03 12:49:32 [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
      2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 215.535 s <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
      2019-04-03 12:53:08 [ERROR] testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor)  Time elapsed: 34.244 s  <<< FAILURE!
      2019-04-03 12:53:08 java.lang.AssertionError: Application killed before lifetime value
      2019-04-03 12:53:08 	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218)
      2019-04-03 12:53:08 
      

      The root cause is the condition here:

              Assert.assertTrue("Application killed before lifetime value",
                  totalTimeRun > maxLifetime);
      

      However, there are two problems with this condition:
      1. Logically it's not correct. In fact, since the app should be killed after 30 seconds, one would expect to see totalTimeRun = maxLifetime. Due to some asynchronicity and rounding, most of the time totalTimeRun ends up being 31.

      2. Sometimes the application is killed fast enough and totalTimeRun is 30, but this is correct, because in setUpCSQueue we set the queue lifetime:

          csConf.setMaximumLifetimePerQueue(
              CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime);
          csConf.setDefaultLifetimePerQueue(
              CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime);
      

      A more proper condition is:

      Assert.assertTrue("Application killed before lifetime value",
                  totalTimeRun >= maxLifetime);
      

      The assertion message in the next line is also misleading:

              Assert.assertTrue(
                  "Application killed before lifetime value " + totalTimeRun,
                  totalTimeRun < maxLifetime + 10L);
      

      If it false, it means that the application is killed after 40 seconds, which exceeds both the app's lifetime (40s) and that of the queue (30s).

              Assert.assertTrue(
                  "Application killed after queue/app lifetime value: " + totalTimeRun,
                  totalTimeRun < maxLifetime + 10L);
      

      We can be even be stricter, since we expect a kill almost immediately after 30 seconds:

              Assert.assertTrue(
                  "Application killed too late: " + totalTimeRun,
                  totalTimeRun < maxLifetime + 2L);
      

      where we allow a 2 second tolerance.

      Attachments

        Issue Links

          Activity

            People

              pbacsko Peter Bacsko
              pbacsko Peter Bacsko
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: