Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7160

Job end notification not sent and client service not stopped after YarnRuntimeException in shutDownJob in MRAppMaster.java despite finally block

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.3
    • None
    • applicationmaster
    • None

    Description

      If a YarnRuntimeException occurs during shutDownJob (in this case, at the line 'MRAppMaster.this.stop();') in MRAppMaster.java, the try/catch block will catch it but the 'finally' block will not get executed. Additionally, the client service should be stopped if an exception occurs and is not currently.

      For example, if this error occurs:

      2018-11-20 12:38:46,173 INFO [Thread-171] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: In stop, writing event TASK_FINISHED 
      2018-11-20 12:38:46,173 ERROR [Thread-171] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Error writing History Event: org.apache.hadoop.mapreduce.jobhistory.TaskFinishedEvent@3b1a7bb5 java.io.IOException: All datanodes [DatanodeInfoWithStorage[10.11.1.227:9866,DS-3539e7b8-5d87-45e5-880a-1897f11577d2,DISK]] are bad. Aborting... 
      at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1561) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1495) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667) 2018-11-20 12:38:46,174 INFO [Thread-171] org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler failed in state STOPPED
      
      

      Then the line 

      catch (Throwable t) { 
         LOG.warn("Graceful stop failed. Exiting.. ", t); 
         exitMRAppMaster(1, t); 
      }
      
      

      causes the MRAppMaster to exit before the following finally block can execute. This means that the job end notification is not sent despite https://jira.apache.org/jira/browse/MAPREDUCE-6895?attachmentOrder=desc

      Additionally, this finally block should also call 'clientService.stop();', as otherwise the following errors occur on the client side:

      2018-11-07 19:50:59,602 INFO mapreduce.Job: map 100% reduce 98% 
      2018-11-07 19:51:03,617 INFO mapreduce.Job: map 100% reduce 99% 
      2018-11-07 19:51:07,648 INFO mapreduce.Job: map 100% reduce 100% 
      2018-11-07 19:51:07,656 INFO mapreduce.Job: Job job_1541647829228_0001 completed successfully 
      2018-11-07 19:51:08,162 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server 
      2018-11-07 19:51:09,169 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-11-07 19:51:10,170 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2018-11-07 19:51:11,171 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
      ......
      2018-11-07 19:51:38,407 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) java.io.IOException: java.net.ConnectException: Your endpoint configuration is wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:344) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:382) at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:859) at org.apache.hadoop.mapreduce.Job$8.run(Job.java:820) at org.apache.hadoop.mapreduce.Job$8.run(Job.java:817) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1686) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:817) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1663) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1591) at org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:334) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:343) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:244) at org.apache.hadoop.util.RunJar.main(RunJar.java:158)
      

       

      Removing the finally block and adding calls to send the job end notification and stop the client service prior to 'exitMRAppMaster' within the catch block, such as something similar to below, fixes the issue:

      catch (Throwable t) {
         LOG.warn("Graceful stop failed. Exiting.. ", t);
         if (isLastAMRetry && notifier != null) {
            sendJobEndNotify(notifier);
         }
         try {
            Thread.sleep(5000);
         } catch (InterruptedException e) {
            e.printStackTrace();
         }
      
         clientService.stop();
         exitMRAppMaster(1, t);
      }
      

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            johnthomason John Thomason
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: