Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.5.0
-
None
Description
We see occasional very long pauses on a bare-metal stress cluster, sometimes up to 15 minutes long. Here's dmesg:
Call Trace: [<ffffffff81529afe>] ? thread_return+0x4e/0x7d0 [<ffffffff810a3c05>] ? __hrtimer_start_range_ns+0x1a5/0x460 [<ffffffff810a3f3d>] ? hrtimer_try_to_cancel+0x3d/0xd0 [<ffffffff8152cc25>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff8152cd83>] rwsem_down_write_failed+0x23/0x30 [<ffffffff81298f43>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff8152c282>] ? down_write+0x32/0x40 [<ffffffff81155838>] sys_munmap+0x48/0x80 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b INFO: task impalad:92504 blocked for more than 120 seconds. Not tainted 2.6.32-504.30.3.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. impalad D 000000000000001d 0 92504 78335 0x00000000 ffff8818579f1e48 0000000000000086 ffff8818579f1dc8 ffff8806947f0d18 ffff8818579f1e68 ffff8818579f1e38 ffff8818699ea400 ffff8818579f1e18 ffff880cad45d3a0 ffff8818579f1e38 ffff8818535a9068 ffff8818579f1fd8 Call Trace: [<ffffffff8152cc25>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff8152cd83>] rwsem_down_write_failed+0x23/0x30 [<ffffffff81298f43>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff8152c282>] ? down_write+0x32/0x40 [<ffffffff81155838>] sys_munmap+0x48/0x80 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b INFO: task impalad:92512 blocked for more than 120 seconds. Not tainted 2.6.32-504.30.3.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. impalad D 0000000000000008 0 92512 78335 0x00000000 ffff8810710fde08 0000000000000086 ffff8810710fdd68 0000000000000246 ffff8810710fdd68 ffffffff8152d18b ffff8810710fdd98 ffffffff8144bea5 ffff8811c12f4940 ffff8805e4b7c080 ffff8812067b7ad8 ffff8810710fdfd8 Call Trace: [<ffffffff8152d18b>] ? _spin_unlock_bh+0x1b/0x20 [<ffffffff8144bea5>] ? release_sock+0xe5/0x110 [<ffffffff8144bf7c>] ? lock_sock_nested+0xac/0xc0 [<ffffffff8152cc25>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff814484a3>] ? move_addr_to_user+0x93/0xb0 [<ffffffff8152cd83>] rwsem_down_write_failed+0x23/0x30 [<ffffffff81298f43>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff8152c282>] ? down_write+0x32/0x40 [<ffffffff8114571c>] sys_mmap_pgoff+0x5c/0x2d0 [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60
The kernel:
root@vc0714 henry]# lsb_release -a LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch Distributor ID: CentOS Description: CentOS release 6.6 (Final) Release: 6.6 Codename: Final
One thought is that THP might be causing this:
[root@vc0714 henry]# cat /sys/kernel/mm/redhat_transparent_hugepage/enabled [always] madvise never [root@vc0714 henry]# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag always madvise [never] [root@vc0714 henry]#