Description
HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down
We are running an HBase cluster with version 2.5.5 (HBase jar sourced from the HBase download page under hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down and initiated a serverCrashProcedure, we encountered an exception. This exception prevented our cluster from recovering.
Below is a snippet of the exception:
2023-08-28 21:02:52,163 INFO [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter (WALSplitter.java:splitWAL(300)) - Splitting hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056, size=15.7 K (16082bytes) 2023-08-28 21:02:52,163 INFO [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils (RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - Recover lease on dfs file hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056 2023-08-28 21:02:52,164 INFO [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils (RecoverLeaseFSUtils.java:recoverLease(175)) - Recovered lease, attempt=0 on file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056 after 0ms 2023-08-28 21:02:52,167 INFO [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter (WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 ms; skipped=0; WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056, size=15.7 K, length=16082, corrupted=false, cancelled=false 2023-08-28 21:02:52,167 ERROR [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252 java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[] org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()' at org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367)
Upon investigation, this seems to be a consequence of the changes introduced in Hadoop 3.3.1 under HDFS-15255. The getLocations method of LocatedBlock has been modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. However, HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, leading to the aforementioned exception. You can view the relevant HBase code here of hbase code.
A potential solution we identified is to rebuild HBase using a patch available at this repository. This appears to rectify the issue.(at least for now).
https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change
This issue helped us investigate and fix.
https://issues.apache.org/jira/browse/HBASE-26198
I'd like to submit a PR to the HBase documentation stating that Hadoop 3.3.1 and later versions are not compatible with HBase (specifically version 2.5.5), provided that this bug is confirmed (or if my observations are accurate).