Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-19163

"Maximum lock count exceeded" from region server's batch processing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0-alpha-1, 2.0.0-alpha-3, 1.2.7
    • 1.4.1, 1.3.3, 2.0.0
    • regionserver
    • None
    • Hide
      When there are many mutations against the same row in a batch, as each mutation will acquire a shared row lock, it will exceed the maximum shared lock count the java ReadWritelock supports (64k). Along with other optimization, the batch is divided into multiple possible minibatches. A new config is added to limit the maximum number of mutations in the minibatch.

         <property>
          <name>hbase.regionserver.minibatch.size</name>
          <value>20000</value>
         </property>
      The default value is 20000.
      Show
      When there are many mutations against the same row in a batch, as each mutation will acquire a shared row lock, it will exceed the maximum shared lock count the java ReadWritelock supports (64k). Along with other optimization, the batch is divided into multiple possible minibatches. A new config is added to limit the maximum number of mutations in the minibatch.    <property>     <name>hbase.regionserver.minibatch.size</name>     <value>20000</value>    </property> The default value is 20000.

    Description

      In one of use cases, we found the following exception and replication is stuck.

      2017-10-25 19:41:17,199 WARN  [hconnection-0x28db294f-shared--pool4-t936] client.AsyncProcess: #3, table=foo, attempt=5/5 failed=262836ops, last exception: java.io.IOException: java.io.IOException: Maximum lock count exceeded
              at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2215)
              at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109)
              at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:185)
              at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:165)
      Caused by: java.lang.Error: Maximum lock count exceeded
              at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.fullTryAcquireShared(ReentrantReadWriteLock.java:528)
              at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:488)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1327)
              at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.tryLock(ReentrantReadWriteLock.java:871)
              at org.apache.hadoop.hbase.regionserver.HRegion.getRowLock(HRegion.java:5163)
              at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3018)
              at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2877)
              at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2819)
              at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:753)
              at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:715)
              at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2148)
              at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33656)
              at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
              ... 3 more
      
      

      While we are still examining the data pattern, it is sure that there are too many mutations in the batch against the same row, this exceeds the maximum 64k shared lock count and it throws an error and failed the whole batch.

      There are two approaches to solve this issue.

      1). Let's say there are mutations against the same row in the batch, we just need to acquire the lock once for the same row vs to acquire the lock for each mutation.
      2). We catch the error and start to process whatever it gets and loop back.

      With HBASE-17924, approach 1 seems easy to implement now.
      Create the jira and will post update/patch when investigation moving forward.

      Attachments

        1. unittest-case.diff
          1 kB
          Hua Xiang
        2. HBASE-19163-master-v001.patch
          5 kB
          Hua Xiang
        3. HBASE-19163.master.001.patch
          5 kB
          Hua Xiang
        4. HBASE-19163.master.002.patch
          5 kB
          Hua Xiang
        5. HBASE-19163.master.004.patch
          5 kB
          Hua Xiang
        6. HBASE-19163.master.005.patch
          13 kB
          Hua Xiang
        7. HBASE-19163.master.006.patch
          17 kB
          Hua Xiang
        8. HBASE-19163.master.007.patch
          17 kB
          Hua Xiang
        9. HBASE-19163.master.008.patch
          18 kB
          Hua Xiang
        10. HBASE-19163.master.009.patch
          19 kB
          Hua Xiang
        11. HBASE-19163.master.009.patch
          19 kB
          Hua Xiang
        12. HBASE-19163.master.010.patch
          13 kB
          Hua Xiang
        13. HBASE-19163-branch-1-v001.patch
          8 kB
          Hua Xiang
        14. HBASE-19163-branch-1-v001.patch
          8 kB
          Hua Xiang

        Issue Links

          Activity

            People

              huaxiang Hua Xiang
              huaxiang Hua Xiang
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: