Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10530

CapacityScheduler ResourceLimits doesn't handle node partition well

    XMLWordPrintableJSON

Details

    Description

      This is a serious bug may impact all releases, I need to do further check but I want to log the JIRA so we will not forget:

      ResourceLimits objects are used to handle two purposes:

      1) When there's cluster resource change, for example adding new node, or scheduler config reinitialize. We will pass ResourceLimits to updateClusterResource to queues.

      2) When allocate container, we try to pass parent's available resource to child to make sure child's resource allocation won't violate parent's max resource. For example below:

      queue         used  max
      --------------------------------------
      root          10    20
      root.a        8     10
      root.a.a1     2     10
      root.a.a2     6     10
      

      Even though a.a1 has 8 resources headroom (a1.max - a1.used). But we can at most allocate 2 resources to a1 because root.a's limit will hit first. This information will be passed down from parent queue to child queue during assignContainers call via ResourceLimits.

      However, we only pass 1 ResourceLimits from top, for queue initialize, we passed in:

          root.updateClusterResource(clusterResource, new ResourceLimits(
              clusterResource));
      

      And when we update cluster resource, we only considered default partition

            // Update all children
            for (CSQueue childQueue : childQueues) {
              // Get ResourceLimits of child queue before assign containers
              ResourceLimits childLimits = getResourceLimitsOfChild(childQueue,
                  clusterResource, resourceLimits,
                  RMNodeLabelsManager.NO_LABEL, false);
              childQueue.updateClusterResource(clusterResource, childLimits);
            }
      

      Same for allocation logic, we passed in: (Actually I found I added a TODO item 5 years ago).

          // Try to use NON_EXCLUSIVE
          assignment = getRootQueue().assignContainers(getClusterResource(),
              candidates,
              // TODO, now we only consider limits for parent for non-labeled
              // resources, should consider labeled resources as well.
              new ResourceLimits(labelManager
                  .getResourceByLabel(RMNodeLabelsManager.NO_LABEL,
                      getClusterResource())),
              SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY);
      

      The good thing is, in the assignContainers call, we calculated child limit based on partition

       
      ResourceLimits childLimits =
                getResourceLimitsOfChild(childQueue, cluster, limits,
                    candidates.getPartition(), true);
      

      So I think now the problem is, when a named partition has more resource than default partition, effective min/max resource of each queue could be wrong.

      Attachments

        Activity

          People

            Unassigned Unassigned
            wangda Wangda Tan
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: