In zookeeper 3.3.5 in file src/java/main/org/apache/zookeeper/server/quorum/Learner.java:325 you may see Thread.sleep(1000);
This is always happens after leader failure or restart. Zookeeper reelects new leader and all followers try to connect to it. But first attempt always fails because of "Connection refused":
2012-07-23 18:55:48,159 - WARN [QuorumPeer:/0.0.0.0:2181:Learner@229] - Unexpected exception, tries=0, connecting to web329.local/192.168.1.74:2888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
I propose to change this line to the next code:
This way first reconnect attempt will be done immediately, other will wait for tick time (this is good semantic change, I suppose).
The result of this change - leader reelection time lowered from >1500ms to 300-400ms with 50ms tick time. This is pretty important for our production environment and will not break any existing installations.