Details
Description
Was running LongClean randomwalk with agitation. Came back to the system with three tables "stuck" in DELETING on the monitor and a generally idle system. Upon investigation, multiple fate txns appear to be deadlocked, in addition to the delete tables.
txid: 7ca950aa8de76a17 status: IN_PROGRESS op: DeleteTable locked: [W:2dc] locking: [] top: CleanUp txid: 1071086efdbed442 status: IN_PROGRESS op: BulkImport locked: [R:2cr] locking: [] top: LoadFiles txid: 32b86cfe06c2ed5d status: IN_PROGRESS op: DeleteTable locked: [W:2d9] locking: [] top: CleanUp txid: 358c065b6cb0516b status: IN_PROGRESS op: DeleteTable locked: [W:2dw] locking: [] top: CleanUp txid: 26b738ee0b044a96 status: IN_PROGRESS op: BulkImport locked: [R:2cr] locking: [] top: CopyFailed txid: 16edd31b3723dc5b status: IN_PROGRESS op: BulkImport locked: [R:2cr] locking: [] top: CopyFailed txid: 63c587eb3df6c1b2 status: IN_PROGRESS op: CompactRange locked: [R:2cr] locking: [] top: CompactionDriver txid: 722d8e5488531735 status: IN_PROGRESS op: BulkImport locked: [R:2cr] locking: [] top: CopyFailed
I started digging into the DeleteTable ops. Each txn still appears to be active and holds the table_lock for their respective table in ZK, but the /tables/id/ node and all of its children (state, conf, name, etc) still exist.
Looking at some thread dumps, I have the default (4) repo runner threads. 3 of them are blocked on bulk imports
"Repo runner 2" daemon prio=10 tid=0x000000000262b800 nid=0x1ae7 waiting on condition [0x00007f25168e7000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000705a05eb8> (a java.util.concurrent.FutureTask) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425) at java.util.concurrent.FutureTask.get(FutureTask.java:187) at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:561) at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:449) at org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65) at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:64) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34) at java.lang.Thread.run(Thread.java:744)
The 4th repo runner is stuck trying to reserve a new txn (not sure why he's locked like this though)
"Repo runner 1" daemon prio=10 tid=0x0000000002627800 nid=0x1ae6 in Object.wait() [0x00007f25169e8000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1313) - locked <0x00000007014d9928> (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1149) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180) at org.apache.accumulo.fate.zookeeper.ZooReader.getData(ZooReader.java:44) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:67) at com.sun.proxy.$Proxy11.getData(Unknown Source) at org.apache.accumulo.fate.ZooStore.reserve(ZooStore.java:160) at org.apache.accumulo.fate.AgeOffStore.reserve(AgeOffStore.java:156) at org.apache.accumulo.fate.Fate$TransactionRunner.run(Fate.java:52) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:34) at java.lang.Thread.run(Thread.java:744)
There were no obvious errors on the monitor, and the master is still presently in this state.