Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7026

Shuffle Fetcher does not log the actual error message thrown by ShuffleHandler

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      A job is failing with reduce tasks failed to fetch map output and the NodeManager ShuffleHandler failed to serve the map outputs with some IOException like below. ShuffleHandler sends the actual error message in response inside sendError() but the Fetcher does not log this message.

      Logs from NodeManager ShuffleHandler:

      2017-12-18 10:10:30,728 ERROR mapred.ShuffleHandler (ShuffleHandler.java:messageReceived(962)) - Shuffle error in populating headers :
      java.io.IOException: Error Reading IndexFile
              at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.populateHeaders(ShuffleHandler.java:1089)
              at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:958)
              at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
              at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
              at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
              at org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:142)
              at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
              at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
              at org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:148)
              at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
              at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
              at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
              at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
              at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
              at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
              at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
              at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
              at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
              at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
              at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
              at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
              at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
              at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
              at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
              at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
              at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
              at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
              at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: Owner 'hbase' for path /grid/7/hadoop/yarn/local/usercache/bde/appcache/application_1512457770852_9447/output/attempt_1512457770852_9447_1_01_000007_0_10004/file.out.index did not match expected owner 'bde'
              at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:285)
              at org.apache.hadoop.io.SecureIOUtils.forceSecureOpenFSDataInputStream(SecureIOUtils.java:174)
              at org.apache.hadoop.io.SecureIOUtils.openFSDataInputStream(SecureIOUtils.java:158)
              at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:70)
              at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:62)
              at org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:119)
      

      Fetcher Logs below without the actual error message:

      2018-04-19 12:24:39,521 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#3
              at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
              at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
              at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
              at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
      Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
              at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366)
              at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288)
              at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:369)
              at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:200)
      

      Attachments

        1. MAPREDUCE-7026.1.patch
          2 kB
          Prabhu Joseph
        2. MAPREDUCE-7026.2.patch
          2 kB
          Prabhu Joseph
        3. MAPREDUCE-7026.3.patch
          2 kB
          Prabhu Joseph

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            groot Ashutosh Gupta
            prabhujoseph Prabhu Joseph

            Dates

              Created:
              Updated:

              Slack

                Issue deployment