Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-25536

Remove 0 length wal file from logQueue if it belongs to old sources.

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      In our production clusters, we found one case where RS is not removing 0 length file from replication queue (in memory one not the zk replication queue) if the logQueue size is 1.
      Stack trace below:

      2021-01-28 14:44:18,434 ERROR [,60020,1609950703085] regionserver.ReplicationSourceWALReaderThread - Failed to read stream of replication entries
      org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException: java.io.EOFException: hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a SequenceFile
      	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:110)
      	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:147)
      Caused by: java.io.EOFException: hdfs://hbase/oldWALs/<rs-name>%2C60020%2C1606126266791.1606852981112 not a SequenceFile
      	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1934)
      	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1893)
      	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1842)
      	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1856)
      	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
      	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
      	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
      	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:313)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:277)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:265)
      	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:424)
      	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:338)
      	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:304)
      	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:295)
      	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:198)
      	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:108)
      	... 1 more
      

      The wal in question is of length 0 (verified via hadoop ls command) and is from recovered sources. There is just 1 log file in the queue (verified via heap dump).

       We have logic to remove 0 length log file from queue when we encounter EOFException and logQueue#size is greater than 1. Code snippet below.

      ReplicationSourceWALReader.java
        // if we get an EOF due to a zero-length log, and there are other logs in queue
        // (highly likely we've closed the current log), we've hit the max retries, and autorecovery is
        // enabled, then dump the log
        private void handleEofException(IOException e) {
          if ((e instanceof EOFException || e.getCause() instanceof EOFException) &&
             logQueue.size() > 1 && this.eofAutoRecovery) {
            try {
              if (fs.getFileStatus(logQueue.peek()).getLen() == 0) {
                LOG.warn("Forcing removal of 0 length log in queue: " + logQueue.peek());
                logQueue.remove();
                currentPosition = 0;
              }
            } catch (IOException ioe) {
              LOG.warn("Couldn't get file length information about log " + logQueue.peek());
            }
          }
        }
      

      This size check is valid for active sources where we need to have atleast one wal file which is the current wal file but for recovered sources where we don't add current wal file to queue, we can skip the logQueue#size check.

      Attachments

        Activity

          People

            shahrs87 Rushabh Shah
            shahrs87 Rushabh Shah
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: