[HBASE-28053] ServerCrashProcedure seems to fail when using Hadoop3.3.1+ - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: hadoop3, wal
Labels:
None

Description

HBase Cluster Issue with Server Crash Procedure After Region Server Goes Down

We are running an HBase cluster with version 2.5.5 (HBase jar sourced from the HBase download page under hadoop3-bin) paired with Hadoop version 3.3.2. When a region server went down and initiated a serverCrashProcedure, we encountered an exception. This exception prevented our cluster from recovering.

Below is a snippet of the exception:

2023-08-28 21:02:52,163 INFO [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter (WALSplitter.java:splitWAL(300)) - Splitting hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056, size=15.7 K (16082bytes)
2023-08-28 21:02:52,163 INFO [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils (RecoverLeaseFSUtils.java:recoverDFSFileLease(86)) - Recover lease on dfs file hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056
2023-08-28 21:02:52,164 INFO [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] util.RecoverLeaseFSUtils (RecoverLeaseFSUtils.java:recoverLease(175)) - Recovered lease, attempt=0 on file=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056 after 0ms
2023-08-28 21:02:52,167 INFO [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] wal.WALSplitter (WALSplitter.java:splitWAL(423)) - Processed 0 edits across 0 Regions in 4 ms; skipped=0; WAL=hdfs://hbase:8020/hbase/WALs/HOSTNAME_HERE,16020,1693214237545-splitting/HOSTNAME_HERE%2C16020%2C1693214237545.1693214243056, size=15.7 K, length=16082, corrupted=false, cancelled=false
2023-08-28 21:02:52,167 ERROR [RS_LOG_REPLAY_OPS-regionserver/HOSTNAME_HERE:16020-1] handler.RSProcedureHandler (RSProcedureHandler.java:process(53)) - pid=5848252
java.lang.NoSuchMethodError: 'org.apache.hadoop.hdfs.protocol.DatanodeInfo[] org.apache.hadoop.hdfs.protocol.LocatedBlock.getLocations()'
at org.apache.hadoop.hbase.fs.HFileSystem$ReorderWALBlocks.reorderBlocks(HFileSystem.java:428)
at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:367)

Upon investigation, this seems to be a consequence of the changes introduced in Hadoop 3.3.1 under ~~HDFS-15255~~. The getLocations method of LocatedBlock has been modified from returning a DatanodeInfo[] to a DatanodeStorageInfo[]. However, HBase 2.5.5 still references DatanodeInfo[] in HFileSystem.java:428, leading to the aforementioned exception. You can view the relevant HBase code here of hbase code.

A potential solution we identified is to rebuild HBase using a patch available at this repository. This appears to rectify the issue.(at least for now).
https://github.com/aplio/hbase/tree/monkeypatch/fix-serverClashProcedure-caused-by-hbase-3-dataNodeInfo-change

This issue helped us investigate and fix.

https://issues.apache.org/jira/browse/HBASE-26198

I'd like to submit a PR to the HBase documentation stating that Hadoop 3.3.1 and later versions are not compatible with HBase (specifically version 2.5.5), provided that this bug is confirmed (or if my observations are accurate).

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: aplio

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 30/Aug/23 15:33

Updated:: 30/Aug/23 15:37