Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
Impala 2.2, Impala 2.3.0
-
impala version 2.3.0-cdh5.5.1 RELEASE
Linux 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux (VMware)
Description
I test impala2.3 in a 5 hosts cluster,and 3 of them running impalad. Sometimes when I shut down 2 impalad hosts, the query hang up. This situation is rarely seen. By checking the impalad log and tcp connection information (through lsof), I found that when I shut down the 2 remote impalad hosts, the local impalad, i.e. the impalad accepting the query request, disconnected tcp connection with one of the 2 remote impalad, but still had tcp connection with the other one of the 2 remote impalad, and the query hang up. Every time the query hang up, the execution state is 'STARTED', and the last event is 'Ready to start remote fragments', and I cannot cancel the query.
BTW, I modified default tcp keepalive parameters, include setting net.ipv4.tcp_keepalive_time=30, net.ipv4.tcp_keepalive_probes=3 and net.ipv4.tcp_keepalive_intvl=10. This means if the tcp server is unreachable, keepalive settings guarantee the tcp client disconnecting the tcp connection actively after 30+3*10=60 seconds, but it seems it do not.
Following is the log related to the hang up query.
I1223 19:15:36.448956 23603 coordinator.cc:315] Exec() query_id=1542be5811b01f41:4624e416aa592b8c I1223 19:15:36.449033 23603 plan-fragment-executor.cc:85] Prepare(): query_id=1542be5811b01f41:4624e416aa592b8c instance_id=1542be5811b01f41:4624e416aa592b8d I1223 19:15:36.449177 23603 plan-fragment-executor.cc:193] descriptor table for fragment=1542be5811b01f41:4624e416aa592b8d tuples: Tuple(id=0 size=24 slots=[Slot(id=0 type=STRING col_path=[3] offset=8 null=(offset=0 mask=2) slot_idx=1 field_idx=-1), Slot(id=1 type=INT col_path=[1] offset=4 null=(offset=0 mask=1) slot_idx=0 field_idx=-1)] tuple_path=[]) I1223 19:15:36.449282 23603 coordinator.cc:391] starting 3 backends for query 1542be5811b01f41:4624e416aa592b8c I1223 19:15:36.450311 24554 fragment-mgr.cc:36] ExecPlanFragment() instance_id=1542be5811b01f41:4624e416aa592b8f coord=vm3:22000 backend#=1 I1223 19:15:36.450402 24554 plan-fragment-executor.cc:85] Prepare(): query_id=1542be5811b01f41:4624e416aa592b8c instance_id=1542be5811b01f41:4624e416aa592b8f I1223 19:15:36.450562 24554 plan-fragment-executor.cc:193] descriptor table for fragment=1542be5811b01f41:4624e416aa592b8f tuples: Tuple(id=0 size=24 slots=[Slot(id=0 type=STRING col_path=[3] offset=8 null=(offset=0 mask=2) slot_idx=1 field_idx=-1), Slot(id=1 type=INT col_path=[1] offset=4 null=(offset=0 mask=1) slot_idx=0 field_idx=-1)] tuple_path=[]) I1223 19:15:36.700852 21520 plan-fragment-executor.cc:303] Open(): instance_id=1542be5811b01f41:4624e416aa592b8f I1223 19:16:15.860250 20878 thrift-util.cc:109] TSocket::read() recv() <Host: ::ffff:192.168.7.115 Port: 45152>Connection reset by peer I1223 19:16:15.860384 20878 thrift-util.cc:109] TThreadedServer client died: ECONNRESET I1223 19:16:16.463649 20879 thrift-util.cc:109] TSocket::read() recv() <Host: ::ffff:192.168.7.114 Port: 49091>Connection reset by peer I1223 19:16:16.463825 20879 thrift-util.cc:109] TThreadedServer client died: ECONNRESET I1223 19:19:35.522938 22979 status.cc:112] Cancelled from Impala's debug web interface @ 0x788a33 impala::Status::Status() @ 0x9e34ea impala::ImpalaServer::CancelQueryUrlCallback() @ 0xae1bd1 impala::Webserver::RenderUrlWithTemplate() @ 0xae2a61 impala::Webserver::BeginRequestCallback() @ 0xaf2e03 handle_request @ 0xaf45a7 process_new_connection @ 0xaf4dd8 worker_thread @ 0x31aa6079d1 (unknown) @ 0x31a9ee8b6d (unknown) I1223 19:19:35.522980 22979 impala-server.cc:862] UnregisterQuery(): query_id=1542be5811b01f41:4624e416aa592b8c I1223 19:19:35.523000 22979 impala-server.cc:943] Cancel(): query_id=1542be5811b01f41:4624e416aa592b8c I1223 19:19:35.575162 22979 status.cc:112] Query not yet running @ 0x788a33 impala::Status::Status() @ 0x9ba69f impala::ImpalaServer::CancelInternal() @ 0x9c2a17 impala::ImpalaServer::UnregisterQuery() @ 0x9e3510 impala::ImpalaServer::CancelQueryUrlCallback() @ 0xae1bd1 impala::Webserver::RenderUrlWithTemplate() @ 0xae2a61 impala::Webserver::BeginRequestCallback() @ 0xaf2e03 handle_request @ 0xaf45a7 process_new_connection @ 0xaf4dd8 worker_thread @ 0x31aa6079d1 (unknown) @ 0x31a9ee8b6d (unknown)
Attachments
Issue Links
- is related to
-
IMPALA-4038 RPC delays for single query can lead to ImpalaServer not making progress on any queries
- Resolved