Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4734

FuzzySnapshotRelatedTest becomes flaky when transient disk failure appears

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.6.0, 3.9.0
    • None
    • tests

    Description

      In testPZxidUpdatedWhenLoadingSnapshot(), a quorum server is stopped and restarted to test for loading snapshots. However, during restarting of quorum server, we would call into ZkDataBase#loadDataBase(), from which an IOException could be thrown because of transient disk failure. 

      public long loadDataBase() throws IOException {
          long zxid = snapLog.restore(dataTree, sessionsWithTimeouts,   commitProposalPlaybackListener); // line 240 and IOException thrown here
          initialized = true;
          return zxid;
      } 

      In FileTxnSnapLog#restore

      public long restore(DataTree dt, Map<Long, Integer> sessions,
                          PlayBackListener listener) throws IOException {
          long deserializeResult = snapLog.deserialize(dt, sessions); // IOException here!       
      ...
      }

      Here is the stacktrace: 

              at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java)
              at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
              at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:862)
              at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:848)
              at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:201)
              at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124)
              at org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:330)
              at java.lang.Thread.run(Thread.java:748) 

      Finally, because of this IOException, restart would be failed and test failed. 

      In terms of the fix, we could either retry the test like the one proposed by ZOOKEEPER-3157 or we could add some configurable retry mechanism to ZkDataBase#loadDataBase() to tolerate possible transient disk failure. 

       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              functioner Haoze Wu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m