[IMPALA-3107] Occasional very long pause in kernel - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: Impala 2.5.0
Fix Version/s: Impala 2.5.0
Component/s: Backend
Labels:
None

Target Version:

Impala 2.5.0

Description

We see occasional very long pauses on a bare-metal stress cluster, sometimes up to 15 minutes long. Here's dmesg:

Call Trace:
 [<ffffffff81529afe>] ? thread_return+0x4e/0x7d0
 [<ffffffff810a3c05>] ? __hrtimer_start_range_ns+0x1a5/0x460
 [<ffffffff810a3f3d>] ? hrtimer_try_to_cancel+0x3d/0xd0
 [<ffffffff8152cc25>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff8152cd83>] rwsem_down_write_failed+0x23/0x30
 [<ffffffff81298f43>] call_rwsem_down_write_failed+0x13/0x20
 [<ffffffff8152c282>] ? down_write+0x32/0x40
 [<ffffffff81155838>] sys_munmap+0x48/0x80
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
INFO: task impalad:92504 blocked for more than 120 seconds.
      Not tainted 2.6.32-504.30.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
impalad       D 000000000000001d     0 92504  78335 0x00000000
 ffff8818579f1e48 0000000000000086 ffff8818579f1dc8 ffff8806947f0d18
 ffff8818579f1e68 ffff8818579f1e38 ffff8818699ea400 ffff8818579f1e18
 ffff880cad45d3a0 ffff8818579f1e38 ffff8818535a9068 ffff8818579f1fd8
Call Trace:
 [<ffffffff8152cc25>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff8152cd83>] rwsem_down_write_failed+0x23/0x30
 [<ffffffff81298f43>] call_rwsem_down_write_failed+0x13/0x20
 [<ffffffff8152c282>] ? down_write+0x32/0x40
 [<ffffffff81155838>] sys_munmap+0x48/0x80
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
INFO: task impalad:92512 blocked for more than 120 seconds.
      Not tainted 2.6.32-504.30.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
impalad       D 0000000000000008     0 92512  78335 0x00000000
 ffff8810710fde08 0000000000000086 ffff8810710fdd68 0000000000000246
 ffff8810710fdd68 ffffffff8152d18b ffff8810710fdd98 ffffffff8144bea5
 ffff8811c12f4940 ffff8805e4b7c080 ffff8812067b7ad8 ffff8810710fdfd8
Call Trace:
 [<ffffffff8152d18b>] ? _spin_unlock_bh+0x1b/0x20
 [<ffffffff8144bea5>] ? release_sock+0xe5/0x110
 [<ffffffff8144bf7c>] ? lock_sock_nested+0xac/0xc0
 [<ffffffff8152cc25>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff814484a3>] ? move_addr_to_user+0x93/0xb0
 [<ffffffff8152cd83>] rwsem_down_write_failed+0x23/0x30
 [<ffffffff81298f43>] call_rwsem_down_write_failed+0x13/0x20
 [<ffffffff8152c282>] ? down_write+0x32/0x40
 [<ffffffff8114571c>] sys_mmap_pgoff+0x5c/0x2d0
 [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60

The kernel:

root@vc0714 henry]# lsb_release -a
LSB Version:    :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: CentOS
Description:    CentOS release 6.6 (Final)
Release:        6.6
Codename:       Final

One thought is that THP might be causing this:

[root@vc0714 henry]# cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
[always] madvise never
[root@vc0714 henry]# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
always madvise [never]
[root@vc0714 henry]#

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

gdb-bt.txt.gz
03/Mar/16 00:03
842 kB
Daniel Hecht

Activity

People

Assignee:: Matthew Jacobs

Reporter:: Henry Robinson

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 01/Mar/16 21:05

Updated:: 21/May/16 22:49

Resolved:: 10/Mar/16 23:18