[ZOOKEEPER-3531] Synchronization on ACLCache cause cluster to hang when network/disk issues happen during datatree serialization - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.5.2, 3.5.3, 3.5.4, 3.5.5
Fix Version/s: 3.6.0
Component/s: None
Labels:
- pull-request-available

Description

During our ZooKeeper fault injection testing, we observed that sometimes the ZK cluster could hang (requests time out, node status shows ok). After inspecting the issue, we believe this is caused by I/O (serializing ACLCache) inside a critical section. The bug is essentially similar to what is described in ZooKeeper-2201.

org.apache.zookeeper.server.DataTree#serialize calls the aclCache.serialize when doing dataree serialization, however, org.apache.zookeeper.server.ReferenceCountedACLCache#serialize could get stuck at OutputArchieve.writeInt due to potential network/disk issues. This can cause the system experiences hanging issues similar to ZooKeeper-2201 (any attempt to create/delete/modify the DataNode will cause the leader to hang at the beginning of the request processor chain). The root cause is the lock contention between:

org.apache.zookeeper.server.DataTree#serialize -> org.apache.zookeeper.server.ReferenceCountedACLCache#serialize
PrepRequestProcessor#getRecordForPath -> org.apache.zookeeper.server.DataTree#getACL(org.apache.zookeeper.server.DataNode) -> org.apache.zookeeper.server.ReferenceCountedACLCache#convertLong

When the snapshot gets stuck in acl serialization, it would prevent all other operations to ReferenceCountedACLCache. Since getRecordForPath calls ReferenceCountedACLCache#convertLong, any op triggering getRecordForPath will cause the leader to hang at the beginning of the request processor chain:

org.apache.zookeeper.server.ReferenceCountedACLCache.convertLong(ReferenceCountedACLCache.java:87)
org.apache.zookeeper.server.DataTree.getACL(DataTree.java:734)
   - locked org.apache.zookeeper.server.DataNode@4a062b7d
org.apache.zookeeper.server.ZKDatabase.aclForNode(ZKDatabase.java:371)
org.apache.zookeeper.server.PrepRequestProcessor.getRecordForPath(PrepRequestProcessor.java:170)
   - locked java.util.ArrayDeque@3f7394f7
org.apache.zookeeper.server.PrepRequestProcessor.pRequest2Txn(PrepRequestProcessor.java:417)
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:757)
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:145)

Similar to ZooKeeper-2201, the leader can still send out heartbeats so the cluster will not recover until the network/disk issue resolves.

Steps to reproduce this bug:

start a cluster with 1 leader and n followers
manually create some ACLs, to enlarge the window of dumping acls so it would be more likely to hang at serializing ACLCache when delay happens. (we wrote a script to generate such workloads, see attachments)
inject long network/disk write delays and run some benchmarks to trigger snapshots
once stuck, you should observe new requests to the cluster would fail.

Essentially the core problem is the OutputArchive write should not be kept inside this synchronization block. So a straightforward solution is to move writes out of sync block: do a copy inside the sync block and perform vulnerable network writes afterwards. The patch for this solution is attached and verified. Another more systematic fix is perhaps replacing all synchronized methods in the ReferenceCountedACLCache with ConcurrentHashMap.

We double checked that the issue remains in the latest version of master branch (68c21988d55c57e483370d3ee223c22da2d1bbcf).

Attachments are 1) patch for fix and regression test 2) scripts to generate workloads to fill ACL cache

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

generator.py
02/Sep/19 21:01
0.3 kB
Chang Lou
fix.patch
02/Sep/19 21:00
4 kB
Chang Lou

Issue Links

links to

GitHub Pull Request #1077

Activity

People

Assignee:: Chang Lou

Reporter:: Chang Lou

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 02/Sep/19 21:02

Updated:: 09/Oct/19 19:28

Resolved:: 09/Oct/19 14:39

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 50m