Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11534

Incorrect exception handling during container recovery

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • yarn
    • Reviewed

    Description

      When NM is restarted during a container recovery, it can happen that it interrupts the container reaquisition during the LinuxContainerExecutor's signalContainer method. In this case we will get the following exception:

      java.io.InterruptedIOException: java.lang.InterruptedException
          at org.apache.hadoop.util.Shell.runCommand(Shell.java:1011)
          at org.apache.hadoop.util.Shell.run(Shell.java:901)
          at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:177)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
          at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
          at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
          at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
          at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: java.lang.InterruptedException
          at java.base/java.lang.Object.wait(Native Method)
          at java.base/java.lang.Object.wait(Object.java:328)
          at java.base/java.lang.ProcessImpl.waitFor(ProcessImpl.java:495)
          at org.apache.hadoop.util.Shell.runCommand(Shell.java:1001)
          ... 15 more

      Later this InterruptedIOException get caught and wrapped inside a PrivilegedOperationException and a ContainerExecutionException. In LinuxContainerExecutor's signalContainer method we catch this exception again, and throw an IOException from it, indicating this error message in the stack trace:

      IOException from it, causing the following stack trace:
      org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Signal container failed
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
          at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
          at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
          at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
          at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          at java.base/java.lang.Thread.run(Thread.java:834)
      2023-06-20 18:24:31,777 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Unable to recover container container_e03_1687266197584_0033_01_000001
      java.io.IOException: Problem signalling container 256974 with NULL; output: null and exitCode: -1
          at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:746)
          at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.isContainerAlive(LinuxContainerExecutor.java:887)
          at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:291)
          at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:708)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47)
          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Signal container failed
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:183)
          at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:184)
          at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:735)
          ... 9 more

       

      Since YARN-2846, we are using a "nonInterrupted" flag in RecoveredContainerLaunch's call method.

      We indicate interruption when we catch InterruptedException and InterruptedIOException, see this code part. By default every container has a 154 (LOST) error code, and if the recovery is interrupted, we will keep this value. But when the flag indicates that an interruption happened, we won't persist this in the NM state store. But since in LinuxContainerExecutor we throw an IOException in the above case, we won't treat it as interrupted. The default "LOST" state will be persisted for the container, and after an NM restart the RM will kill it.

       

      The goal of this ticket is to improve the exception handling here, and indicate somehow the interruption if signalContainer method cannot be run successfully.

      Attachments

        Issue Links

          Activity

            People

              pszucs Peter Szucs
              pszucs Peter Szucs
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: