Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-4277

AsyncDispatcher can hang on serviceStop if the eventhandling thread is in BLOCKED state

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.10.1
    • None
    • None

    Description

      While further testing TEZ-4236, I discovered a strange hang in AsyncDispatcher. The symptom was a BLOCKED state event-handling thread. This block is most likely caused by the dispatcher itself, so instead of making yet another sync hack in the dispatcher, or even DrainDispatcher (however they have a common codebase for serviceStop at the moment), I'm about to put a safety check, which solved my issue.

      jstack of a hang (it hangs here)

      "Thread-1212" #1581 prio=5 os_prio=31 tid=0x00007fdb4c628800 nid=0x1b84f in Object.wait() [0x000070000997b000]
      java.lang.Thread.State: WAITING (on object monitor)
      at java.lang.Object.wait(Native Method)
      at java.lang.Thread.join(Thread.java:1252)
      - locked <0x00000007ae967340> (a java.lang.Thread)
      at java.lang.Thread.join(Thread.java:1326)
      at org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:157)
      at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
      - locked <0x00000007ae642cc8> (a java.lang.Object)
      at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54)
      at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102)
      at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:67)
      at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1933)
      at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2155)
      - locked <0x00000007ae53d3c0> (a org.apache.tez.dag.app.LocalDAGAppMaster)
      at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
      - locked <0x00000007ae53d5f0> (a java.lang.Object)
      at org.apache.tez.client.LocalClient.close(LocalClient.java:135)
      at org.apache.tez.client.TezClient.stop(TezClient.java:775)
      - locked <0x00000007ade485a0> (a org.apache.tez.client.TezClient)
      at org.apache.tez.test.TestLocalMode.testMultipleClientsWithSession(TestLocalMode.java:147)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
      at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
      at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
      at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
      at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
      

      during the investigation I added some log messages, that lead to this solution:

      1. there is not a single event in the queue
      2. event handling thread is in BLOCKED state
      3. AM is already shut down, but test hangs as LocalClient.stop waits to AM services to be stopped

      2021-01-29 16:16:44,350 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop, ignoring any new events.
      ...
      2021-01-29 16:16:45,354 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
      2021-01-29 16:16:46,358 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
      2021-01-29 16:16:47,362 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
      ...
      2021-01-29 16:16:48,365 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
      2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1002)) - Calling stop for all the services
      2021-01-29 16:16:49,349 INFO  [AMShutdownThread] app.DAGAppMaster (DAGAppMaster.java:run(1013)) - Exiting DAGAppMaster..GoodBye!
      2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(158)) - Waiting for AsyncDispatcher to drain...current events count: 0, handler thread state: BLOCKED
      2021-01-29 16:16:49,370 WARN  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(153)) - Cannot wait for drain events anymore, logging current events if any...
      2021-01-29 16:16:49,370 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:logCurrentEvents(186)) - Current events in the queue: 0
      2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(168)) - interrupting event handling thread, state: BLOCKED
      2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(170)) - interrupted event handling thread, state: BLOCKED
      2021-01-29 16:16:49,371 INFO  [Thread-2198] common.AsyncDispatcher (AsyncDispatcher.java:serviceStop(172)) - joining event handling thread, state: BLOCKED
      

      under normal circumstances the event handling thread is in WAITING state (which I think is because of queue.take)

      + I also included the timeout logic of YARN-3999 here

      Attachments

        Issue Links

          Activity

            People

              abstractdog László Bodor
              abstractdog László Bodor
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m