Uploaded image for project: 'REEF (Retired)'
  1. REEF (Retired)
  2. REEF-1291

Driver gets FailedEvaluator message even if evaluator shuts down properly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.15
    • REEF.NET
    • None
    • C#

    Description

      It seems that there is some sort of race condition when evaluator is completed....The evaluator says that it is done and exits while driver gets the same message as if RM has taken the container away.

      To be more concrete, end of evaluator log and relevant driver log are shown below. One can clearly see that shut down appeared to be clean but driver got failed evaluator message.The driver log shows that stream on which it was reading message from evaluator was closed probably because evaluator closed it. A simple fix would be that if driver expects this evaluator to shutdown and it gets this message then it assumes that shutdown is complete.

      End of Evaluator Log
      INFO: Handle Evaluator control message
      Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
      INFO: Send task control message to ContextManager
      Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextManager Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
      INFO: RemoveContext with id DataLoading-container_e12_1458013582010_0048_01_000750
      Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime Warning: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
      WARNING: Shutting down an task because the underlying context is being closed.
      Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
      INFO: Trying to close Task IMRUMap-Id297-Version0
      Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Warning: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
      WARNING: Trying to close an task that is in Done state. Ignored.
      Org.Apache.REEF.Common.Context.Defaults.DefaultContextStopHandler Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
      INFO: DefaultContextStopHandler received for context: DataLoading-container_e12_1458013582010_0048_01_000750
      Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
      INFO: Context stack is empty, done
      Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
      INFO: Evaluator state : DONE
      Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0085
      INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[], evaluator_status=[DONE], context_status=[], timestamp=[1459201619420], recoveryFlag =[False].
      Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4207980-07:00 0001
      INFO: Runtime stop
      Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2016-03-28T14:46:59.4364238-07:00 0001
      INFO: EvaluatorRuntime shutdown complete
      Org.Apache.REEF.Evaluator.Evaluator Information: 0 : 2016-03-28T14:46:59.4364238-07:00 0001
      INFO: Evaluator is returned from Run()
      End of LogType:evaluator.stdout

      Relevant driver log
      INFO: Received a JobStatus message that can't be sent:
      identifier: "BroadcastReduce"
      state: RUNNING
      message: "Evaluator container_e12_1458013582010_0048_01_000750 failed with message: Evaluator [container_e12_1458013582010_0048_01_000750] is assumed to be in state [RUNNING]. But the resource manager reports it to be in state [DONE]. This means that the Evaluator failed but wasn\'t able to send an error message back to the driver."
      Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught
      INFO: Unexpected exception from downstream. channel: [id: 0x51bdfbfb, /10.200.144.38:64266 => /10.200.144.168:8959] local: /10.200.144.168:8959 remote: /10.200.144.38:64266
      Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener exceptionCaught
      WARNING: ExceptionEvent: local: /10.200.144.168:8959 remote: /10.200.144.247:53406 :: java.io.IOException: An existing connection was forcibly closed by the remote host
      Mar 28, 2016 2:47:00 PM org.apache.reef.javabridge.generic.JobDriver$CompletedEvaluatorHandler onNext
      INFO: Handling the event of completed evaluator in CLR bridge.
      Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught
      INFO: Unexpected exception from downstream. channel: [id: 0xdc25e349, /10.200.145.206:64394 => /10.200.144.168:8959] local: /10.200.144.168:8959 remote: /10.200.145.206:64394
      Mar 28, 2016 2:47:00 PM org.apache.reef.wake.remote.transport.netty.NettyChannelHandler exceptionCaught
      WARNING: Unexpected exception from downstream.
      java.io.IOException: An existing connection was forcibly closed by the remote host
      at sun.nio.ch.SocketDispatcher.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
      at sun.nio.ch.IOUtil.read(IOUtil.java:192)
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
      at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:446)
      at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:879)
      at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
      at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:114)
      at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
      at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
      at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
      at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
      at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
      at java.lang.Thread.run(Thread.java:745)

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dkm2110 Dhruv Mahajan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: