[YARN-484] Fair Scheduler preemption fails if the other queue has a mapreduce job with some tasks in excess of cluster capacity - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: None
Fix Version/s: None
Component/s: scheduler
Labels:
- hadoop
Environment:

Mac OS X; CDH4.1.2; CDH4.2.0

Description

This is reliably reproduced while running CDH4.1.2 or CDH4.2.0 on a single Mac OS X machine.

Two queues are being configured: cjmQ and slotsQ. Both queues are configured with tiny minResources. The intention is for the task(s) of the job in cjmQ to be able to preempt tasks of the job in slotsQ.
yarn.nodemanager.resource.memory-mb = 24576
First, a long-running 6-map-task (0 reducers) mapreduce job is started in slotsQ with mapreduce.map.memory.mb=4096. Because MRAppMaster's container consumes some memory, only 5 of its 6 map tasks are able to start, and the 6th is pending, but will never run.
Then, a short-running 1-map-task (0 reducers) mapreduce job is submitted via cjmQ with mapreduce.map.memory.mb=2048.

Expected behavior:
At this point, because the minimum share of cjmQ has not been met, I expected Fair Scheduler to preempt one of the executing map tasks from the single slotsQ mapreduce job to make room for the single map tasks of the cjmQ mapreduce job. However, Fair Scheduler didn't preempt any of the running map tasks of the slotsQ job. Instead, the cjmQ job was being starved perpetually. Since slotsQ had far more than its minimum share allocated to it and already running, while cjmQ was far below its minimum share (0 actually), Fair Scheduler should have started preempting, regardless of there being one task container from the slotsQ job (the 6th map container) that was not being allocated.

Additional useful info:

If I summit a second 1-map-task mapreduce job via cjmQ, the first cjmQ mapreduce job in that Q gets scheduled and its state changes to RUNNING; once that that first job completes, then the second job submitted via cjmQ gets starved until a third job is submitted into cjmQ, and so on. This happens regardless of the values of maxRunningApps in the queue configurations.
If, instead of requesting 6 map tasks for the slotsQ job, I only request 5 so that everything fits nicely into yarn.nodemanager.resource.memory-mb - without that 6th pending, but not running task - then preemption works as I would have expected. However, I cannot rely on this arrangement because in a production cluster that is running at full capacity, if a machine dies, the mapreduce job from slotsQ will request new containers for the failed tasks and because the cluster was already at capacity, those containers will end up as pending and will never run, recreating my original scenario of the starving cjmQ job.
I initially wrote this up on https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/0zv62pkN5lM, so it would be good to update that group with the resolution.

Configuration:

In yarn-site.xml:

  <property>
    <description>Scheduler plug-in class to use instead of the default scheduler.</description>
    <name>yarn.resourcemanager.scheduler.class&lt;/name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>

fair-scheduler.xml:

<configuration>

<!-- Site specific FairScheduler configuration properties -->
  <property>
    <description>Absolute path to allocation file. An allocation file is an XML
    manifest describing queues and their properties, in addition to certain
    policy defaults. This file must be in XML format as described in
    http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
    </description>
    <name>yarn.scheduler.fair.allocation.file</name>
    <value>[obfuscated]/current/conf/site/default/hadoop/fair-scheduler-allocations.xml</value>
  </property>

  <property>
    <description>Whether to use preemption. Note that preemption is experimental
    in the current version. Defaults to false.</description>
    <name>yarn.scheduler.fair.preemption</name>
    <value>true</value>
  </property>

  <property>
    <description>Whether to allow multiple container assignments in one
    heartbeat. Defaults to false.</description>
    <name>yarn.scheduler.fair.assignmultiple</name>
    <value>true</value>
  </property>
  
</configuration>

My fair-scheduler-allocations.xml:

<allocations>

  <queue name="cjmQ">
    <!-- minimum amount of aggregate memory; TODO which units??? -->
    <minResources>2048</minResources>

    <!-- limit the number of apps from the queue to run at once -->
    <maxRunningApps>1</maxRunningApps>
    
    <!-- either "fifo" or "fair" depending on the in-queue scheduling policy
    desired -->
    <schedulingMode>fifo</schedulingMode>

    <!-- Number of seconds after which the pool can preempt other pools'
      tasks to achieve its min share. Requires preemption to be enabled in
      mapred-site.xml by setting mapred.fairscheduler.preemption to true.
      Defaults to infinity (no preemption). -->
    <minSharePreemptionTimeout>5</minSharePreemptionTimeout>

    <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
    <weight>1.0</weight>
  </queue>

  <queue name="slotsQ">
    <!-- minimum amount of aggregate memory; TODO which units??? -->
    <minResources>1</minResources>

    <!-- limit the number of apps from the queue to run at once -->
    <maxRunningApps>1</maxRunningApps>

    <!-- Number of seconds after which the pool can preempt other pools'
      tasks to achieve its min share. Requires preemption to be enabled in
      mapred-site.xml by setting mapred.fairscheduler.preemption to true.
      Defaults to infinity (no preemption). -->
    <minSharePreemptionTimeout>5</minSharePreemptionTimeout>

    <!-- Pool's weight in fair sharing calculations. Defaulti is 1.0. -->
    <weight>1.0</weight>
  </queue>
  
  <!-- number of seconds a queue is under its fair share before it will try to
  preempt containers to take resources from other queues. -->
  <fairSharePreemptionTimeout>5</fairSharePreemptionTimeout>

</allocations>

Fair Scheduler preemption fails if the other queue has a mapreduce job with some tasks in excess of cluster capacity

Details

Description

Attachments

Activity

People

Dates