[IMPALA-2799] Query hang up if remote impalad hosts shut down - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: Impala 2.2, Impala 2.3.0
Fix Version/s: Impala 2.8.0
Component/s: Distributed Exec
Labels:
- hang
Environment:
impala version 2.3.0-cdh5.5.1 RELEASE
Linux 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux (VMware)

Target Version:

Impala 2.8.0

Description

I test impala2.3 in a 5 hosts cluster，and 3 of them running impalad. Sometimes when I shut down 2 impalad hosts, the query hang up. This situation is rarely seen. By checking the impalad log and tcp connection information (through lsof), I found that when I shut down the 2 remote impalad hosts, the local impalad, i.e. the impalad accepting the query request, disconnected tcp connection with one of the 2 remote impalad, but still had tcp connection with the other one of the 2 remote impalad, and the query hang up. Every time the query hang up, the execution state is 'STARTED', and the last event is 'Ready to start remote fragments', and I cannot cancel the query.

BTW, I modified default tcp keepalive parameters, include setting net.ipv4.tcp_keepalive_time=30, net.ipv4.tcp_keepalive_probes=3 and net.ipv4.tcp_keepalive_intvl=10. This means if the tcp server is unreachable, keepalive settings guarantee the tcp client disconnecting the tcp connection actively after 30+3*10=60 seconds, but it seems it do not.

Following is the log related to the hang up query.

I1223 19:15:36.448956 23603 coordinator.cc:315] Exec() query_id=1542be5811b01f41:4624e416aa592b8c
I1223 19:15:36.449033 23603 plan-fragment-executor.cc:85] Prepare(): query_id=1542be5811b01f41:4624e416aa592b8c instance_id=1542be5811b01f41:4624e416aa592b8d
I1223 19:15:36.449177 23603 plan-fragment-executor.cc:193] descriptor table for fragment=1542be5811b01f41:4624e416aa592b8d
tuples:
Tuple(id=0 size=24 slots=[Slot(id=0 type=STRING col_path=[3] offset=8 null=(offset=0 mask=2) slot_idx=1 field_idx=-1), Slot(id=1 type=INT col_path=[1] offset=4 null=(offset=0 mask=1) slot_idx=0 field_idx=-1)] tuple_path=[])
I1223 19:15:36.449282 23603 coordinator.cc:391] starting 3 backends for query 1542be5811b01f41:4624e416aa592b8c
I1223 19:15:36.450311 24554 fragment-mgr.cc:36] ExecPlanFragment() instance_id=1542be5811b01f41:4624e416aa592b8f coord=vm3:22000 backend#=1
I1223 19:15:36.450402 24554 plan-fragment-executor.cc:85] Prepare(): query_id=1542be5811b01f41:4624e416aa592b8c instance_id=1542be5811b01f41:4624e416aa592b8f
I1223 19:15:36.450562 24554 plan-fragment-executor.cc:193] descriptor table for fragment=1542be5811b01f41:4624e416aa592b8f
tuples:
Tuple(id=0 size=24 slots=[Slot(id=0 type=STRING col_path=[3] offset=8 null=(offset=0 mask=2) slot_idx=1 field_idx=-1), Slot(id=1 type=INT col_path=[1] offset=4 null=(offset=0 mask=1) slot_idx=0 field_idx=-1)] tuple_path=[])
I1223 19:15:36.700852 21520 plan-fragment-executor.cc:303] Open(): instance_id=1542be5811b01f41:4624e416aa592b8f
I1223 19:16:15.860250 20878 thrift-util.cc:109] TSocket::read() recv() <Host: ::ffff:192.168.7.115 Port: 45152>Connection reset by peer
I1223 19:16:15.860384 20878 thrift-util.cc:109] TThreadedServer client died: ECONNRESET
I1223 19:16:16.463649 20879 thrift-util.cc:109] TSocket::read() recv() <Host: ::ffff:192.168.7.114 Port: 49091>Connection reset by peer
I1223 19:16:16.463825 20879 thrift-util.cc:109] TThreadedServer client died: ECONNRESET
I1223 19:19:35.522938 22979 status.cc:112] Cancelled from Impala's debug web interface
    @           0x788a33  impala::Status::Status()
    @           0x9e34ea  impala::ImpalaServer::CancelQueryUrlCallback()
    @           0xae1bd1  impala::Webserver::RenderUrlWithTemplate()
    @           0xae2a61  impala::Webserver::BeginRequestCallback()
    @           0xaf2e03  handle_request
    @           0xaf45a7  process_new_connection
    @           0xaf4dd8  worker_thread
    @       0x31aa6079d1  (unknown)
    @       0x31a9ee8b6d  (unknown)
I1223 19:19:35.522980 22979 impala-server.cc:862] UnregisterQuery(): query_id=1542be5811b01f41:4624e416aa592b8c
I1223 19:19:35.523000 22979 impala-server.cc:943] Cancel(): query_id=1542be5811b01f41:4624e416aa592b8c
I1223 19:19:35.575162 22979 status.cc:112] Query not yet running
    @           0x788a33  impala::Status::Status()
    @           0x9ba69f  impala::ImpalaServer::CancelInternal()
    @           0x9c2a17  impala::ImpalaServer::UnregisterQuery()
    @           0x9e3510  impala::ImpalaServer::CancelQueryUrlCallback()
    @           0xae1bd1  impala::Webserver::RenderUrlWithTemplate()
    @           0xae2a61  impala::Webserver::BeginRequestCallback()
    @           0xaf2e03  handle_request
    @           0xaf45a7  process_new_connection
    @           0xaf4dd8  worker_thread
    @       0x31aa6079d1  (unknown)
    @       0x31a9ee8b6d  (unknown)

Attachments

Issue Links

is related to

IMPALA-4038 RPC delays for single query can lead to ImpalaServer not making progress on any queries

Resolved

Query hang up if remote impalad hosts shut down

Details

Description

Attachments

Issue Links

Activity

People

Dates