Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
given the following scenario:
1. DAG is assigned to an AM
2. AM is killed (e.g. OOMKilled by k8s), HS2 keeps asking the status, facing network errors:
hiveserver2 <14>1 2024-02-26T15:59:56.538Z hiveserver2-0 hiveserver2 1 dedef3f4-339f-4ba3-a6ae-300751d3561d [mdc@18060 class="client.DAGClientImpl" dagId="dag_1708961199044_0003_1" level="INFO" operationLogLevel="EXECUTION" queryId="hive_20240226155836_6b1e9eb9-efd7-42fd-8872-f4189c5dda3a" sessionId="9e4cb344-ad7f-4344-9b24-aedaf0e73bf4" thread="HiveServer2-Background-Pool: Thread-129"] Cannot retrieve DAG Status due to IOException: DestHost:destPort query-coordinator-0-0.query-coordinator-0-service.compute-1708603165-qlg5.svc.cluster.local:22222 , LocalHost:localPort hiveserver2-0/100.100.83.80:0. Failed on local exception: java.io.IOException: java.io.IOException: Connection reset by peer
by this time, HS2 cannot tell if the AM is lost forever, or there is a recoverable intermittent network issue
3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG status (getDagStatus call) from the restarted coordinator, HS2 isn't even able to realize it was talking to a new AM, and keeps asking for DAG status
4. in AM, the below exception is kept thrown and it's not handled by the DagClient
<14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1 10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" level="INFO" thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on 22222, call Call#15312255 Retry#0 org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus from 127.0.0.6:56221 org.apache.tez.dag.api.TezException: No running dag at present at org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99) at org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102) at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
AM should be able to return a specialized exception which can be handled by the client
Attachments
Issue Links
- causes
-
TEZ-4559 Fix Retry logic in case of Recovery
- Resolved
- relates to
-
HIVE-28093 Re-execute DAG in case of NoCurrentDAGException
- Closed
- links to