Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4643

Committed txns may be improperly truncated if follower crashes right after updating currentEpoch but before persisting txns to disk

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.6.3, 3.7.0, 3.8.0, 3.7.1, 3.9.0, 3.8.1, 3.9.1
    • None
    • quorum, server
    • Important

    Description

      When a follower is processing the NEWLEADER message in SYNC phase, it will update its currentEpoch to the file before writing the txns (from the PROPOSALs sent by leader in SYNC) to the log file. Such execution order may lead to improper truncation of committed txns on other servers in later rounds.

      The critical step to trigger this problem is to make a follower node crash right after it updates its currentEpoch to the file but before writing the txns to the log file. The potential risk is that, this node with incomplete committed txns might be later elected as the leader with its larger currentEpoch, and then improperly uses TRUNC to ask other nodes to truncate their committed txns!

       

      Trace

      Trace-ZK-4643.pdf

      Here is an example to trigger the bug. (Focus on currentEpoch and lastLoggedZxid)

      Round 1 (Running nodes with their acceptedEpoch & currentEpoch set to 1):

      • Start the ensemble with three nodes: S0, S1 & S2.
      • S2 is elected leader.
      • For all of them, currentEpoch = 1, lastLoggedZxid (the last zxid in the log)= <1, 3>, lastProcessedZxid = <1, 3>.
      • S0 crashes.
      • A new txn <1, 4> is logged and committed by S1 & S2. Then, S1 & S2 have lastLoggedZxid = <1, 4>, lastProcessedZxid = <1, 4> .
      • Verify clients can read the datatree with latest zxid <1, 4>. 

      Round 2 (Running nodes with their acceptedEpoch & currentEpoch set to 2):

      • S0 & S2 restart, and S1 crashes.
      • Again, S2 is elected leader.
      • Then, during the SYNC phase, the leader S2 (maxCommittedLog = <1, 4>) uses DIFF to sync with the follower S0 (lastLoggedZxid = <1, 3>), and their currentEpoch will be set to 2 (and written to disk).
      • ( Note that the follower S0 updates its currentEpoch file before writing the txns to the log file when receiving NEWLEADER message. )
      • Unfortunately, right after the follower S0 finishes updating its currentEpoch file, it crashes.

      Round 3 (Running nodes with their acceptedEpoch & currentEpoch set to 3):

      • S0 & S1 restart, and S2 crashes.
      • Since S0 has currentEpoch = 2, S1 has currentEpoch = 1, S0 will be elected leader.
      • During the SYNC phase, the leader S0 (maxCommittedLog = <1, 3>) will use TRUNC to sync with S1 (lastLoggedZxid = <1, 4>). Then, S1 removes txn <1, 4>.  
      • ( However, <1, 4> was committed and visible by clients before, and is not supposed to be truncated! )
      • Verify clients of S0 & S1 do NOT have the view of txn <1, 4>, a violation of ZAB.

       

      Extra note: The trace can be constructed with quorum nodes alive at any moment with careful time tuning of node crash & restart, e.g., let S1 restart before S0 crashes at the end of Round 2.

       

      Analysis

      Root Cause:

      When a follower updates its current epoch, it should guarantee that it has already synced the uncommitted txns to the disk (or, taken snapshot). Otherwise, after the current epoch is updated to the file but the history (transaction log) of the follower is not updated yet, environment failures might prevent the latter from going on smoothly. It is dangerous for a node with updated current epoch but stale history to be elected leader. It might truncate committed txns on other nodes.

       

      Property Violation:

      • From the server side, the ensemble deletes a committed txn, which is not allowed; the committed log of the ensemble does not append monotonically; different nodes have inconsistent committed logs.
      • From the client side, clients connected to different nodes may have inconsistent views. A client may read stale data after a newer version is obtained. That newer version can only be obtained from certain nodes of the ensemble rather than all nodes. What's worse, that newer version may also be removed later.

      ZOOKEEPER-3911 shows similar symptoms, but its fix only mitigates the occurrance of the problem without solving it at the root. We also raise an issue related to that fix in ZOOKEEPER-4646

      Although ZOOKEEPER-4646 has similar symptoms and property violations, it should be regarded as a distinct problem because it has different root cause and risk pattern compared to this one. More specifically,

      • ZOOKEEPER-4643 : the risk lies in the order of updating currentEpoch before logging txns to disk. The bug can be triggered by interrupting the action of logging txns after currentEpoch is updated. 
      • ZOOKEEPER-4646 : the risk lies in the order of replying ACK-LD before logging txns to disk. The bug can be triggered by interrupting the action of logging txns after ACK-LD is replied. 

       

      Gap between Theory & Reality:

      The critical step here is to interrupt two updates that should be done together. However, there lie some gaps between the theory and the reality. In the Zab paper, a follower updates its current epoch and history during the SYNC phase in an atomic action, and the correctness of the Zab protocol is based on that. Unfortunately, in actual environment, a node crash (or other environment failures) may occur at any time, breaking the procedures that will not be interrupted at the protocol level. By digging into the code,  the synchronized section in the method syncWithLeader(long) in Learner.java can only prevent the procedure from interrupted by other threads, but environment failures can still interrupt the process. We think it important to keep ZooKeeper still in a correct condition even under such type of environment circumstances.

       

      Affected Versions:

      The above trace has been generated in multiple versions such as 3.7.1 & 3.8.1 (the latest stable & current version till now) by our testing tools. The affected versions might be more, because the critical update order between the follower's current epoch and its history during SYNC stay unchanged across multiple versions.

       

      Possible Fix

      In theory, this issue can be avoided by exchanging the order of writing currentEpoch to the currentEpoch file and writing the txns (in the queue packetsNotCommitted) to the log file when the follower is processing NEWLEADER message. Although it is a more conservative strategy to persist txns to disk before updating currentEpoch to disk, it prevents the problem that a node with the incomplete committed txns can win an election with its higher currentEpoch. 

      When it comes to the implementation of multi-threading, things become a little more complicated. In the procedure of processing NEWLEADER message in the method syncWithLeader(long) in Learner.java, simply exchanging the order of updating currentEpoch and writing txns to the log file cannot fix the problem, because these two actions are conducted asynchronously by different threads. When the QuorumPeer thread calls logRequest(..) to submit log persistence task to the SyncThread, the latter thread will write txns to disk asynchronously, without promising to finish that before the node's currentEpoch updated. 

      Considering this issue and ZOOKEEPER-4646, one possible fix is to guarantee the following partial orders to be satisfied:

      • A follower updates its currentEpoch only after it has persisted the txns that might be applied to the leader's datatree before the leader gets into the BROADCAST phase (so as to avoid this issue). 
      • A follower replies ACK-LD only after it has persisted the txns that might be applied to the leader's datatree before the leader gets into the BROADCAST phase  (so as to avoid the issue of ZOOKEEPER-4646 ).

      Attachments

        1. Trace-ZK-4643.pdf
          322 kB
          Sirius

        Activity

          People

            Unassigned Unassigned
            ouyang Sirius
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 10m
                2h 10m