Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3591

Inconsistent resync with dynamic reconfig

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.5
    • None
    • server
    • None

    Description

      We've run into a problem where one of the zookeeper instances lost most of its data after its zk process has been restarted. We suspect an interaction between dynamic reconfiguration and snapshot-based resync of that instance. Details and some amateurish analysis are below. We can also upload transaction logs, if need be.

      We have a 6-instance ensemble running version 3.5.5 with 3 quorum members and 3 observers. One of the observers (Instance 6) saw its db shrink from 3162 znodes down to 10 after that instance restarted:

      > 2019-10-13T16:44:19.060+0000 [.zk-monitor-0] Monitor command mntr: zk_version 3.5.5-afd10a8846b22a34c5a818034bb22e99dd44587b, built on 09/16/2019 18:31 GMT
      > zk_znode_count 3162
      > –
      > 2019-10-13T16:48:32.713+0000 [.zk-monitor-0] Monitor command mntr: zk_version 3.5.5-afd10a8846b22a34c5a818034bb22e99dd44587b, built on 09/16/2019 18:31 GMT
      > zk_znode_count 10

      Contrast it with Instance 1 that was the leader at the time, and whose znode_count remained stable around 3000:

      > 2019-10-13T16:44:48.625+0000 [.zk-monitor-0] Monitor command mntr: zk_version 3.5.5-afd10a8846b22a34c5a818034bb22e99dd44587b, built on 09/16/2019 18:31 GMT
      > zk_znode_count 3178
      > –
      > ...
      > –
      > 2019-10-13T16:48:48.731+0000 [.zk-monitor-0] Monitor command mntr: zk_version 3.5.5-afd10a8846b22a34c5a818034bb22e99dd44587b, built on 09/16/2019 18:31 GMT
      > zk_znode_count 3223

      It appears that the problem had happened 30 minutes earlier, when Instance 6 got resynced from the leader via the Snap method, yet skipped creating an on-disk snapshot. The end result was that the in-memory state was fine, but there was only the primordial snapshot.0 on disk, and transaction logs only started after the missing snapshot:

      $ ls -l version-2
      > total 1766
      > rw-rr- 1 daautomation daautomation 1 Oct 13 09:14 acceptedEpoch
      > rw-rr- 1 daautomation daautomation 1 Oct 13 10:12 currentEpoch
      > rw-rr- 1 daautomation daautomation 2097168 Oct 13 09:44 log.6000002e0
      > rw-rr- 1 daautomation daautomation 1048592 Oct 13 10:09 log.600001f1b
      > rw-rr- 1 daautomation daautomation 4194320 Oct 13 12:16 log.600003310
      > rw-rr- 1 daautomation daautomation 770 Oct 13 09:14 snapshot.0

      So the zk reboot wiped out most of the state.

      Dynamic reconfig might be relevant here. Instance 6 started as an observer, got removed, and immediately re-added as a participant. Instance 2 went the other way, from participant to observer:

      > 2019-10-13T16:14:19.323+0000 ZK reconfig: removing node 6
      > 2019-10-13T16:14:19.359+0000 ZK reconfig: adding server.6=10.80.209.138:2888:3888:participant;0.0.0.0:2181
      > 2019-10-13T16:14:19.399+0000 ZK reconfig: adding server.2=10.80.209.131:2888:3888:observer;0.0.0.0:2181

      Looking at the logs, Instance 6 started and received a resync snapshot from the leader right before the dynamic reconfig:

      > 2019-10-13T16:14:19.284+0000 [.QuorumPeer[myid=6](plain=/0.0.0.0:2181)(secure=disabled)] Getting a snapshot from leader 0x6000002dd
      > ...
      > 2019-10-13T16:14:19.401+0000 [.QuorumPeer[myid=6](plain=/0.0.0.0:2181)(secure=disabled)] Got zxid 0x6000002de expected 0x1

      Had it processed the NEWLEADER packet afterwards, it would've persisted the snapshot locally. But there's no NEWLEADER message in the Instance 6 log. Instead, there's a "changes proposed in reconfig" exception, likely a result of the instance getting dynamically removed and re-added as a participant:

      > 2019-10-13T16:14:19.467+0000 [.QuorumPeer[myid=6](plain=/0.0.0.0:2181)(secure=disabled)] Becoming a non-voting participant
      > 2019-10-13T16:14:19.467+0000 [.QuorumPeer[myid=6](plain=/0.0.0.0:2181)(secure=disabled)] Exception when observing the leaderjava.lang.Exception: changes proposed in reconfig\n\tat org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:506)\n\tat org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:74)\n\tat org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1258)

      Perhaps the NEWLEADER packet was still in the socket, but sitting behing INFORMANDACTIVATE, whose exception prevented us from processing NEWLEADER?

      Also, it may or may not be related, but this area got changed recently as part of https://issues.apache.org/jira/browse/ZOOKEEPER-3104.

      Attachments

        1. instance6.log.gz
          14 kB
          Alex Mirgorodskiy
        2. instance1.log.gz
          261 kB
          Alex Mirgorodskiy

        Activity

          People

            Unassigned Unassigned
            mirg Alex Mirgorodskiy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: