Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-19147 All branch-2 unit tests pass
  3. HBASE-20038

TestLockProcedure.testTimeout is flakey

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0-beta-2, 2.0.0
    • proc-v2, test
    • None

    Description

      The test is simple so it is easy to find out the problem.

      2018-02-21 04:53:32,230 INFO  [Time-limited test] hbase.ResourceChecker(148): before: master.locking.TestLockProcedure#testTimeout Thread=218, OpenFileDescriptor=853, MaxFileDescriptor=60000, SystemLoadAverage=5075, ProcessCount=312, AvailableMemoryMB=5373
      2018-02-21 04:53:32,234 WARN  [Time-limited test] procedure2.ProcedureTestingUtility(146): Set Kill before store update to: false
      2018-02-21 04:53:32,278 DEBUG [Time-limited test] procedure2.ProcedureExecutor(866): Stored pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE
      2018-02-21 04:53:32,285 DEBUG [Time-limited test] locking.LockProcedure(195): Heartbeat pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE
      2018-02-21 04:53:32,286 DEBUG [PEWorker-1] locking.LockProcedure(312): LOCKED pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE
      2018-02-21 04:53:32,303 DEBUG [Time-limited test] locking.TestLockProcedure(204): Proc id 14 acquired lock.
      2018-02-21 04:53:32,298 INFO  [PEWorker-1] procedure2.TimeoutExecutorThread(82): ADDED pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE; timeout=2000, timestamp=1519188814298
      2018-02-21 04:53:33,303 DEBUG [Time-limited test] locking.LockProcedure(195): Heartbeat pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE
      2018-02-21 04:53:33,304 DEBUG [Time-limited test] locking.TestLockProcedure(225): Proc id 14 : LOCKED.
      2018-02-21 04:53:34,299 DEBUG [ProcExecTimeout] locking.LockProcedure(207): Timeout failure ProcedureEvent for pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE, ready=false, [pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE]
      2018-02-21 04:53:34,299 DEBUG [ProcExecTimeout] locking.LockProcedure(210): Calling wake on ProcedureEvent for pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE, ready=false, [pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE]
      2018-02-21 04:53:34,299 INFO  [PEWorker-1] procedure2.TimeoutExecutorThread(82): ADDED pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE; timeout=2000, timestamp=1519188816299
      2018-02-21 04:53:34,306 DEBUG [Time-limited test] locking.LockProcedure(195): Heartbeat pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE
      2018-02-21 04:53:34,306 DEBUG [Time-limited test] locking.TestLockProcedure(225): Proc id 14 : LOCKED.
      2018-02-21 04:53:34,766 WARN  [HBase-Metrics2-1] impl.MetricsConfig(125): Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
      2018-02-21 04:53:36,299 DEBUG [ProcExecTimeout] locking.LockProcedure(207): Timeout failure ProcedureEvent for pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE, ready=false, [pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE]
      2018-02-21 04:53:36,299 DEBUG [ProcExecTimeout] locking.LockProcedure(210): Calling wake on ProcedureEvent for pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE, ready=false, [pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE]
      2018-02-21 04:53:36,299 INFO  [PEWorker-1] procedure2.TimeoutExecutorThread(82): ADDED pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE; timeout=2000, timestamp=1519188818299
      2018-02-21 04:53:38,299 DEBUG [ProcExecTimeout] locking.LockProcedure(207): Timeout failure ProcedureEvent for pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE, ready=false, [pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE]
      2018-02-21 04:53:38,299 DEBUG [ProcExecTimeout] locking.LockProcedure(210): Calling wake on ProcedureEvent for pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE, ready=false, [pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE]
      2018-02-21 04:53:38,310 DEBUG [Time-limited test] locking.LockProcedure(195): Heartbeat pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE
      2018-02-21 04:53:38,310 INFO  [PEWorker-1] procedure2.TimeoutExecutorThread(82): ADDED pid=14, state=WAITING_TIMEOUT; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE; timeout=2000, timestamp=1519188820310
      2018-02-21 04:53:38,311 WARN  [Time-limited test] procedure2.ProcedureTestingUtility(146): Set Kill before store update to: false
      2018-02-21 04:53:38,342 DEBUG [PEWorker-1] locking.LockProcedure(242): UNLOCKED pid=14, state=RUNNABLE; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE
      2018-02-21 04:53:38,366 INFO  [PEWorker-1] procedure2.ProcedureExecutor(1247): Finished pid=14, state=SUCCESS; org.apache.hadoop.hbase.master.locking.LockProcedure, namespace=namespace, type=EXCLUSIVE in 6.0920sec
      2018-02-21 04:53:38,628 INFO  [Time-limited test] hbase.ResourceChecker(172): after: master.locking.TestLockProcedure#testTimeout Thread=209 (was 218), OpenFileDescriptor=851 (was 853), MaxFileDescriptor=60000 (was 60000), SystemLoadAverage=5045 (was 5075), ProcessCount=314 (was 312) - ProcessCount LEAK? -, AvailableMemoryMB=5071 (was 5373)
      

      The problem is that, sleep 2 * HEARTBEAT_TIMEOUT is not safe. It is possible that after the first TIMEOUT the procedure has not been timeout yet and then the check for unlocked and the actual timeout processing will arrive at almost the same time. You can see the above log, we are going to mark the procedure as timeout at 04:53:38,299 and the heartbeat for checking unlock arrives at 04:53:38,310.

      I think increase the wait time can solve the problem.

      Attachments

        1. HBASE-20038.patch
          1 kB
          Duo Zhang
        2. HBASE-20038.patch
          1 kB
          Michael Stack
        3. HBASE-20038.patch
          1 kB
          Michael Stack

        Activity

          People

            zhangduo Duo Zhang
            zhangduo Duo Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: