Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-28053

ServerCrashProcedure seems to fail when using Hadoop3.3.1+

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • hadoop3, wal
    • None

    Description

      HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down

      We are running an HBase cluster with version 2.5.5 (HBase jar sourced from the HBase download page under hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down and initiated a serverCrashProcedure, we encountered an exception. This exception prevented our cluster from recovering.

      Below is a snippet of the exception:

      2023-08-28 21:02:52,163 INFO [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter (WALSplitter.java:splitWAL(300)) - Splitting hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056, size=15.7 K (16082bytes)
      2023-08-28 21:02:52,163 INFO [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils (RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - Recover lease on dfs file hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
      2023-08-28 21:02:52,164 INFO [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils (RecoverLeaseFSUtils.java:recoverLease(175)) - Recovered lease, attempt=0 on file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056 after 0ms
      2023-08-28 21:02:52,167 INFO [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter (WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 ms; skipped=0; WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056, size=15.7 K, length=16082, corrupted=false, cancelled=false
      2023-08-28 21:02:52,167 ERROR [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
      java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[] org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
      at org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
      at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367)

      Upon investigation, this seems to be a consequence of the changes introduced in Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has been modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. However, HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, leading to the aforementioned exception. You can view the relevant HBase code here of hbase code.

      A potential solution we identified is to rebuild HBase using a patch available at this repository. This appears to rectify the issue.(at least for now).
      https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change

       

      This issue helped us investigate and fix.

      https://issues.apache.org/jira/browse/HBASE-26198

       

      I'd like to submit a PR to the HBase documentation stating that Hadoop 3.3.1 and later versions are not compatible with HBase (specifically version 2.5.5), provided that this bug is confirmed (or if my observations are accurate).

      Attachments

        Activity

          People

            Unassigned Unassigned
            aplio aplio
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: