[OOZIE-1921] Curator client reports connection loss to ZK under high load - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: trunk
Fix Version/s: None
Component/s: HA
Labels:
None

Description

Seeing two types of Connection Loss exceptions via Curator when running Oozie in high load (specifically workflows with ~80 forked actions)

[1] (znode transaction type: delete)

org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
        at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
        at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)

[2]

org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for
/oozie/locks/0037706-140704041907-oozie-oozi-W
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)

Tracking a particular job between the ZK trace logs reporting NoNode KeeperExceptions and Oozie logs, found that after encountering the zookeeper exceptions with 'delete' of job
lock znode, that particular job never succeeds in getting lock and proceeding.
Not that familiar with when Oozie via Curator tries to delete znodes. ~~OOZIE-1906~~ will introduce the Reaper.

Exception stacktrace pointing to Curator code:

ConnectionState.getZookeeper() {
...
boolean localIsConnected = isConnected.get();
if ( !localIsConnected )

{ checkTimeouts(); }

..
}

isConnected is FALSE and so exception is getting thrown from checkTimeouts(). Wasn't able to find any good docs or benchmarks explaining timeout issues Curator would face due to high load. My suspicion is Curator might have limitations in how many concurrent requests for same lock it can handle. In this particular stress test, there are 85 forked actions all contending for same job lock. Hence we should implement some fallback mechanism in Oozie while invoking Curator APIs.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Mona Chitnis

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Jul/14 00:06

Updated:: 03/Aug/16 18:39