Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
2.2.0
-
None
-
None
-
Reviewed
-
Fixes master procedure store migration issues going from 2.0.x to 2.2.x and/or 2.3.x. Also fixes failed heartbeat parse during rolling upgrade from 2.0.x. to 2.3.x.
Description
When we upgraded HBASE cluster from 2.2.0-RC0 to 2.3.0 or 2.3.3, the HMaster on upgraded node failed to start.
The error message is shown below:
2020-11-02 23:04:01,998 ERROR [master/2c4006997f99:16000:becomeActiveMaster] master.HMaster: Failed to become active masterorg.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: state at org.apache.hbase.thirdparty.com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:79) at org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.checkMessageInitialized(AbstractParser.java:68) at org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:120) at org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:125) at org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48) at org.apache.hbase.thirdparty.com.google.protobuf.Any.unpack(Any.java:228) at org.apache.hadoop.hbase.procedure2.ProcedureUtil$StateSerializer.deserialize(ProcedureUtil.java:124) at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.deserializeStateData(RegionRemoteProcedureBase.java:352) at org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure.deserializeStateData(OpenRegionProcedure.java:72) at org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProcedure(ProcedureUtil.java:294) at org.apache.hadoop.hbase.procedure2.store.ProtoAndProcedure.getProcedure(ProtoAndProcedure.java:43) at org.apache.hadoop.hbase.procedure2.store.InMemoryProcedureIterator.next(InMemoryProcedureIterator.java:90) at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore$1.load(RegionProcedureStore.java:194) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore$2.load(WALProcedureStore.java:474) at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader.finish(ProcedureWALFormatReader.java:151) at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.load(ProcedureWALFormat.java:103) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.load(WALProcedureStore.java:465) at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.tryMigrate(RegionProcedureStore.java:184) at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.recoverLease(RegionProcedureStore.java:257) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:587) at org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1572) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:950) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2240) at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:622) at java.lang.Thread.run(Thread.java:748)2020-11-02 23:04:01,998 ERROR [master/2c4006997f99:16000:becomeActiveMaster] master.HMaster: ***** ABORTING master 2c4006997f99,16000,1604358237412: Unhandled exception. Starting shutdown. *****org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: state at org.apache.hbase.thirdparty.com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:79) at org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.checkMessageInitialized(AbstractParser.java:68) at org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:120) at org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:125) at org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48) at org.apache.hbase.thirdparty.com.google.protobuf.Any.unpack(Any.java:228) at org.apache.hadoop.hbase.procedure2.ProcedureUtil$StateSerializer.deserialize(ProcedureUtil.java:124) at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.deserializeStateData(RegionRemoteProcedureBase.java:352) at org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure.deserializeStateData(OpenRegionProcedure.java:72) at org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProcedure(ProcedureUtil.java:294) at org.apache.hadoop.hbase.procedure2.store.ProtoAndProcedure.getProcedure(ProtoAndProcedure.java:43) at org.apache.hadoop.hbase.procedure2.store.InMemoryProcedureIterator.next(InMemoryProcedureIterator.java:90) at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore$1.load(RegionProcedureStore.java:194) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore$2.load(WALProcedureStore.java:474) at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader.finish(ProcedureWALFormatReader.java:151) at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.load(ProcedureWALFormat.java:103) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.load(WALProcedureStore.java:465) at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.tryMigrate(RegionProcedureStore.java:184) at org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.recoverLease(RegionProcedureStore.java:257) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:587) at org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1572) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:950) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2240) at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:622) at java.lang.Thread.run(Thread.java:748)2020-11-02 23:04:01,999 INFO [master/2c4006997f99:16000:becomeActiveMaster] regionserver.HRegionServer: ***** STOPPING region server '2c4006997f99,16000,1604358237412' *****2020-11-02 23:04:01,999 INFO [master/2c4006997f99:16000:becomeActiveMaster] regionserver.HRegionServer: STOPPED: Stopped by master/2c4006997f99:16000:becomeActiveMaster2020-11-02 23:04:02,814 INFO [2c4006997f99:16000.splitLogManager..Chore.1] hbase.ScheduledChore: Chore: SplitLogManager Timeout Monitor was stopped2020-11-02 23:04:03,652 INFO [master/2c4006997f99:16000] ipc.NettyRpcServer: Stopping server on /252.17.1.2:160002020-11-02 23:04:03,658 INFO [master/2c4006997f99:16000] regionserver.HRegionServer: Stopping infoServer2020-11-02 23:04:03,871 INFO [master/2c4006997f99:16000] handler.ContextHandler: Stopped o.e.j.w.WebAppContext@6136998b{/,null,UNAVAILABLE}{file:/hbase/hbase-webapps/master}2020-11-02 23:04:03,877 INFO [master/2c4006997f99:16000] server.AbstractConnector: Stopped ServerConnector@60d1b21f{HTTP/1.1,[http/1.1]}{0.0.0.0:16010}2020-11-02 23:04:03,878 INFO [master/2c4006997f99:16000] handler.ContextHandler: Stopped o.e.j.s.ServletContextHandler@aa004a0{/static,file:///hbase/hbase-webapps/static/,UNAVAILABLE}2020-11-02 23:04:03,878 INFO [master/2c4006997f99:16000] handler.ContextHandler: Stopped o.e.j.s.ServletContextHandler@5965844d{/logs,file:///hbase/logs/,UNAVAILABLE}2020-11-02 23:04:03,888 INFO [master/2c4006997f99:16000] regionserver.HRegionServer: aborting server 2c4006997f99,16000,16043582374122020-11-02 23:04:03,889 INFO [master/2c4006997f99:16000] regionserver.HRegionServer: stopping server 2c4006997f99,16000,1604358237412; all regions closed.2020-11-02 23:04:03,889 INFO [master/2c4006997f99:16000] hbase.ChoreService: Chore service for: master/2c4006997f99:16000 had [] on shutdown2020-11-02 23:04:03,890 INFO [master/2c4006997f99:16000] region.RegionProcedureStore: Stopping the Region Procedure Store, isAbort=true2020-11-02 23:04:03,894 WARN [master/2c4006997f99:16000] master.ActiveMasterManager: Failed get of master address: java.io.IOException: Can't get master address from ZooKeeper; znode data == null2020-11-02 23:04:03,894 INFO [master/2c4006997f99:16000] region.MasterRegion: Closing local region {ENCODED => 1595e783b53d99cd5eef43b6debb2682, NAME => 'master:store,,1.1595e783b53d99cd5eef43b6debb2682.', STARTKEY => '', ENDKEY => ''}, isAbort=true2020-11-02 23:04:03,901 INFO [master/2c4006997f99:16000] regionserver.HRegion: Closing region master:store,,1.1595e783b53d99cd5eef43b6debb2682.2020-11-02 23:04:03,903 INFO [master/2c4006997f99:16000] regionserver.HRegion: Closed master:store,,1.1595e783b53d99cd5eef43b6debb2682.2020-11-02 23:04:03,903 INFO [master/2c4006997f99:16000] hbase.ChoreService: Chore service for: 2c4006997f99:16000.splitLogManager. had [] on shutdown2020-11-02 23:04:03,904 INFO [master:store-WAL-Roller] wal.AbstractWALRoller: LogRoller exiting.2020-11-02 23:04:03,998 INFO [ReadOnlyZKClient-252.17.1.5:2181@0x0432f3aa] zookeeper.ZooKeeper: Session: 0x10138d353e5001b closed2020-11-02 23:04:03,998 INFO [ReadOnlyZKClient-252.17.1.5:2181@0x0432f3aa-EventThread] zookeeper.ClientCnxn: EventThread shut down for session: 0x10138d353e5001b2020-11-02 23:04:04,098 INFO [master/2c4006997f99:16000] zookeeper.ZooKeeper: Session: 0x10138d353e50018 closed2020-11-02 23:04:04,098 INFO [master/2c4006997f99:16000] regionserver.HRegionServer: Exiting; stopping=2c4006997f99,16000,1604358237412; zookeeper connection closed.2020-11-02 23:04:04,098 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down for session: 0x10138d353e500182020-11-02 23:04:04,099 ERROR [main] master.HMasterCommandLine: Master exitingjava.lang.RuntimeException: HMaster Aborted at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:244) at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:140) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149) at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:3072)
It can be reproduced through the following steps:
- Start up a cluster of version 2.2.0-RC0 with 3 nodes
- Use hbase pe to write data.
/hbase/bin/hbase pe --nomapred --oneCon=true --valueSize=10 --rows=100 sequentialWrite 1
- Stop the cluster:
- using the graceful_stop.sh to stop all regionservers.
- Then run stop-hbase.sh
- Upgrade the node to branch-2.3
- After upgraded, as the log, hbase--master-2c4006997f99.log, suggested, HMaster failed to start.
Attachments
Issue Links
- is related to
-
HBASE-25340 Protobuf Mesage Incompatibility Detector
- Open
-
HBASE-22074 Should use procedure store to persist the state in reportRegionStateTransition
- Resolved
- relates to
-
HBASE-25234 [Upgrade]Incompatibility in reading RS report from 2.1 RS when Master is upgraded to a version containing HBASE-21406
- Resolved
- links to