Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10903

Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

    XMLWordPrintableJSON

Details

    Description

      The headroom check in  `ParentQueue.canAssign` and `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.

      This will cause a lot of "Failed to accept allocation proposal" when a queue is near-fully used.
      In the log:
      Headroom: memory:256, vCores:729
      Request: memory:56320, vCores:5
      clusterResource: memory:673966080, vCores:110494
      If use the DRF, then

      Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
          currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
          required); 

      will be true but in fact we can not allocate resources to the request due to the max limit(no enough memory).

      2021-07-21 23:49:39,012 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: showRequests: application=application_1626747977559_95859 headRoom=<memory:256, vCores:729> currentConsumption=0
      2021-07-21 23:49:39,012 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:  Request={AllocationRequestId: -1, Priority: 1, Capability: <memory:56320, vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution Type Request: null, Node Label Expression: prod-best-effort-node}
      .....
      2021-07-21 23:49:39,013 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Try to commit allocation proposal=New org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
               ALLOCATED=[(Application=appattempt_1626747977559_95859_000001; Node=xxxx:8041; Resource=<memory:56320, vCores:5>)]
      2021-07-21 23:49:39,013 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: userLimit is fetched. userLimit=<memory:7077376, vCores:1277>, userSpecificUserLimit=<memory:7077376, vCores:1277>, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
      2021-07-21 23:49:39,013 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Headroom calculation for user xxxxx:  userLimit=<memory:7077376, vCores:1277> queueMaxAvailRes=<memory:0, vCores:0> consumed=<memory:0, vCores:0> partition=prod-best-effort-node
      2021-07-21 23:49:39,013 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: Used resource=<memory:7077120, vCores:548> exceeded maxResourceLimit of the queue =<memory:7089920, vCores:1278>
      2021-07-21 23:49:39,013 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Failed to accept allocation proposal
       

      Attachments

        Issue Links

          Activity

            People

              jackwangcs jackwangcs
              jackwangcs jackwangcs
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m