Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11490

JMX QueueMetrics breaks after mutable config validation in CS

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      Reproduction steps:

      1. Submit a long running job

      hadoop-3.4.0-SNAPSHOT/bin/yarn jar hadoop-3.4.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.4.0-SNAPSHOT-tests.jar sleep -m 1 -r 1 -rt 1200000 -mt 20
      

      2. Verify that there is one running app

      $ curl http://localhost:8088/ws/v1/cluster/metrics | jq
      

      3. Verify that the JMX endpoint reports 1 running app as well

      $ curl http://localhost:8088/jmx | jq
      

      4. Validate the configuration (x2)

      $ curl -X POST -H 'Content-Type: application/json' -d @defaultqueue.json localhost:8088/ws/v1/cluster/scheduler-conf/validate
      
      $ cat defaultqueue.json
      {"update-queue":{"queue-name":"root.default","params":{"entry":{"key":"maximum-applications","value":"100"}}},"subClusterId":"","global":null,"global-updates":null}
      

      5. Check 2. and 3. again. The cluster metrics should still work but the JMX endpoint will show 0 running apps, that's the bug.

      It is caused by YARN-11211, reverting that patch (or only removing the QueueMetrics.clearQueueMetrics(); line) fixes the issue. But I think that would re-introduce the memory leak.

      It looks like the QUEUE_METRICS hash map is "add-only", the clearQueueMetrics() was only called from ResourceManager.reinitialize() method (transitionToActive/transitionToStandby) prior to YARN-11211. Constantly adding and removing queues with unique names would cause a leak as well, because there is no remove from QUEUE_METRICS, so it is not just the validation API that has this problem.

      Attachments

        1. addqueue.xml
          0.6 kB
          Tamas Domok
        2. defaultqueue.json
          0.2 kB
          Tamas Domok
        3. hadoop-tdomok-resourcemanager-tdomok-MBP16.log
          171 kB
          Tamas Domok
        4. removequeue.xml
          0.4 kB
          Tamas Domok
        5. stopqueue.json
          0.1 kB
          Tamas Domok

        Issue Links

          Activity

            People

              tdomok Tamas Domok
              tdomok Tamas Domok
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: