Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3107

Occasional very long pause in kernel

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 2.5.0
    • Impala 2.5.0
    • Backend
    • None

    Description

      We see occasional very long pauses on a bare-metal stress cluster, sometimes up to 15 minutes long. Here's dmesg:

      Call Trace:
       [<ffffffff81529afe>] ? thread_return+0x4e/0x7d0
       [<ffffffff810a3c05>] ? __hrtimer_start_range_ns+0x1a5/0x460
       [<ffffffff810a3f3d>] ? hrtimer_try_to_cancel+0x3d/0xd0
       [<ffffffff8152cc25>] rwsem_down_failed_common+0x95/0x1d0
       [<ffffffff8152cd83>] rwsem_down_write_failed+0x23/0x30
       [<ffffffff81298f43>] call_rwsem_down_write_failed+0x13/0x20
       [<ffffffff8152c282>] ? down_write+0x32/0x40
       [<ffffffff81155838>] sys_munmap+0x48/0x80
       [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
      INFO: task impalad:92504 blocked for more than 120 seconds.
            Not tainted 2.6.32-504.30.3.el6.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      impalad       D 000000000000001d     0 92504  78335 0x00000000
       ffff8818579f1e48 0000000000000086 ffff8818579f1dc8 ffff8806947f0d18
       ffff8818579f1e68 ffff8818579f1e38 ffff8818699ea400 ffff8818579f1e18
       ffff880cad45d3a0 ffff8818579f1e38 ffff8818535a9068 ffff8818579f1fd8
      Call Trace:
       [<ffffffff8152cc25>] rwsem_down_failed_common+0x95/0x1d0
       [<ffffffff8152cd83>] rwsem_down_write_failed+0x23/0x30
       [<ffffffff81298f43>] call_rwsem_down_write_failed+0x13/0x20
       [<ffffffff8152c282>] ? down_write+0x32/0x40
       [<ffffffff81155838>] sys_munmap+0x48/0x80
       [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
      INFO: task impalad:92512 blocked for more than 120 seconds.
            Not tainted 2.6.32-504.30.3.el6.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      impalad       D 0000000000000008     0 92512  78335 0x00000000
       ffff8810710fde08 0000000000000086 ffff8810710fdd68 0000000000000246
       ffff8810710fdd68 ffffffff8152d18b ffff8810710fdd98 ffffffff8144bea5
       ffff8811c12f4940 ffff8805e4b7c080 ffff8812067b7ad8 ffff8810710fdfd8
      Call Trace:
       [<ffffffff8152d18b>] ? _spin_unlock_bh+0x1b/0x20
       [<ffffffff8144bea5>] ? release_sock+0xe5/0x110
       [<ffffffff8144bf7c>] ? lock_sock_nested+0xac/0xc0
       [<ffffffff8152cc25>] rwsem_down_failed_common+0x95/0x1d0
       [<ffffffff814484a3>] ? move_addr_to_user+0x93/0xb0
       [<ffffffff8152cd83>] rwsem_down_write_failed+0x23/0x30
       [<ffffffff81298f43>] call_rwsem_down_write_failed+0x13/0x20
       [<ffffffff8152c282>] ? down_write+0x32/0x40
       [<ffffffff8114571c>] sys_mmap_pgoff+0x5c/0x2d0
       [<ffffffff8100c6f5>] ? math_state_restore+0x45/0x60
      

      The kernel:

      root@vc0714 henry]# lsb_release -a
      LSB Version:    :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
      Distributor ID: CentOS
      Description:    CentOS release 6.6 (Final)
      Release:        6.6
      Codename:       Final
      

      One thought is that THP might be causing this:

      [root@vc0714 henry]# cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
      [always] madvise never
      [root@vc0714 henry]# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
      always madvise [never]
      [root@vc0714 henry]#
      

      Attachments

        1. gdb-bt.txt.gz
          842 kB
          Daniel Hecht

        Activity

          People

            mjacobs Matthew Jacobs
            henryr Henry Robinson
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: