[HDFS-15114] JournalNodes' committed-txid file includes aborted transaction, breaks NameNode startup - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.2.1
Fix Version/s: None
Component/s: journal-node, namenode
Labels:
None

Description

A couple of days ago, our active NameNode in an HA setup aborted a QuorumOutputStream starting at tx 3389424 because tx 3389425 failed to be written. This was likely related to a rolling restart of the three JournalNodes that was happening at this time. The NameNode logged:

2020-01-11 02:00:50,229 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.6.1.181
2020-01-11 02:00:50,229 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2020-01-11 02:00:50,229 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 3389424, 3389424
2020-01-11 02:00:50,229 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 1 Number of transactions batched in Syncs:
 0 Number of syncs: 1 SyncTimes(ms): 1 7
2020-01-11 02:00:50,245 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 10.6.2.187:8485 failed to write txns 3389425-3389425. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open ; journal id: <journal_name>
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:545)
... rest of stacktrace ...

// the same warning for the second JournalNode
// the same warning for the third JournalNode

2020-01-11 02:00:50,246 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], stream=QuorumOutputStream starting at txid 3389424))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:

// the same "Can't write, no segment open ; journal id: <journal_name>" error for all 3 JournalNodes

2020-01-11 02:00:50,246 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 3389424
2020-01-11 02:00:50,255 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], stream=QuorumOutputStream starting at txid 3389424))

Even though the stream was aborted, the committed-txid file on each of the three JournalNodes was updated to be 3389424.

This caused both NameNodes to fail to start with this error:

2020-01-11 02:54:35,483 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2020-01-11 02:54:35,491 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments...
2020-01-11 02:54:35,537 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Successfully started new epoch 80
2020-01-11 02:54:35,537 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Beginning recovery of unclosed segment starting at txid 3389422
2020-01-11 02:54:35,574 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Recovery prepare phase complete. Responses:
10.6.1.4:8485: segmentState { startTxId: 3389422 endTxId: 3389423 isInProgress: false } lastWriterEpoch: 57 lastCommittedTxId: 3389424
10.6.2.187:8485: segmentState { startTxId: 3389422 endTxId: 3389423 isInProgress: false } lastWriterEpoch: 57 lastCommittedTxId: 3389424
2020-01-11 02:54:35,575 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Using longest log: 10.6.1.4:8485=segmentState {
  startTxId: 3389422
  endTxId: 3389423
  isInProgress: false
}
lastWriterEpoch: 57
lastCommittedTxId: 3389424

2020-01-11 02:54:35,575 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], stream=null))
java.lang.AssertionError: Decided to synchronize log to startTxId: 3389422
endTxId: 3389423
isInProgress: false
 but logger 10.6.1.4:8485 had seen txid 3389424 committed
        ... rest of stacktrace ...
2020-01-11 02:54:35,577 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], stream=null))

A potentially relevant detail is that each of the three JournalNodes logged this at around the time of the aborted stream:

2020-01-11 02:00:32,649 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Latest log is EditLogFile(file=<path_to_edits>/<journal_name>/current/edits_inprogress_0000000000003389424,first=0000000000003389424,last=0000000000003389424,inProgress=true,hasCorruptHeader=false) ; journal id: <journal_name>

That file was later renamed to edits_inprogress_0000000000003389424.stale, and the JournalNodes no longer consider it to be the most recent edit file. Instead, they now log that edits_0000000000003389422-0000000000003389423 is the most recent file they know about.

So what seems to have happened is:

1. The active NameNode tries to write transactions 3389424 and 3389425 to the journal.
2. Tx 3389424 is written successfully, and the committed-txid file is updated to reflect this.
3. Tx 3389425 fails to be written, so the NameNode aborts the QuorumOutputStream that contained 3389424 and 3389425.
4. Because the stream was aborted, the JournalNodes rename edits_inprogress_0000000000003389424 to edits_inprogress_0000000000003389424.stale and forget about tx 3389424.
5. The NameNodes fail to start because the latest committed tx according to the JournalNodes (which comes from the committed-txid file) is in the future, which (understandably) is an illegal state.

I believe that the NameNodes and the JournalNodes are in a consistent state, and that there's no corruption or data loss or anything like that. I think the problem is just that the committed-txid file was updated prematurely, and contains an incorrect value.

For this particular cluster, I ran ./service/bin/hdfs namenode -initializeSharedEdits to format the JournalNodes, but I backed up the edits directory first in case there's more information in there that's helpful for debugging.

JournalNodes' committed-txid file includes aborted transaction, breaks NameNode startup

Details

Description

Attachments

Activity

People

Dates