[HDDS-3004] OM HA stability issues - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.4.0
Fix Version/s: 0.5.0
Component/s: Ozone Manager
Labels:
- OMHATest

Target Version/s:

0.5.0

Description

To conclude a little, major issues that I find:

When I do a long running s3g writing to cluster with OM HA and I stop the Om leader to force a re-election, the writing will stop and can never recover.

--updates 2020-02-20:

https://issues.apache.org/jira/browse/HDDS-3031 fixes this issue.

2. If I force a OM re-election and do a scm restart after that, the cluster cannot see any leader datanode and no datanodes are able to send pipeline reports, which makes the cluster unavailable as well. I consider this a multi-failover case when the leader OM and SCM are on the same node and there is a short outage happen to the node.

--updates 2020-02-20:

When you do a jar swap for a new version of Ozone and enable OM HA while keeping the same ozone-site.xml as last time, if you've written some data into the last Ozone cluster (and therefore there are existing versions and metadata for om and scm), SCM cannot be up after the jar swap.

Error logs: PipelineID=aae4f728-82ef-4bbb-a0a5-7b3f2af030cc not found in scm out logs when scm process cannot be started.

--updates 2020-02-24:

After I add some logs to SCM starter:
Assuming SCM is only bounced after the leader OM is stopped
1. If SCM is bounced after former leader OM is restarted, meaning all OMs are up, SCM will be bootstrapped correctly but there will be missing pipeline report from the node who doesn't have OM process on it (it's always him tho). This would cause all pipelines stay at ALLOCATED state and cluster will be in safemode. At this point, if I restart the blacksheep datanode, it will come back and send the pipeline report to SCM and all pipelines will be at OPEN state.
2. If SCM is bounced before the former leader OM is restarted, meaning not all OMs in ratis ring are up, SCM cannot be bootstrapped correctly and it shows Pipeline not found.

Original posting:

Use S3 gateway to keep writing data into a specific s3 gateway endpoint. After the writer starts to work, I kill the OM process on the OM leader host. After that, the s3 gateway can never allow writing data and keeps reporting InternalError for all new coming keys.

Process Process-488:

 S3UploadFailedError: Failed to upload ./20191204/file1056.dat to ozone-test-reproduce-123/./20191204/file1056.dat: An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-489:
 S3UploadFailedError: Failed to upload ./20191204/file9631.dat to ozone-test-reproduce-123/./20191204/file9631.dat: An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-490:
 S3UploadFailedError: Failed to upload ./20191204/file7520.dat to ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-491:
 S3UploadFailedError: Failed to upload ./20191204/file4220.dat to ozone-test-reproduce-123/./20191204/file4220.dat: An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-492:
 S3UploadFailedError: Failed to upload ./20191204/file5523.dat to ozone-test-reproduce-123/./20191204/file5523.dat: An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error
 Process Process-493:
 S3UploadFailedError: Failed to upload ./20191204/file7520.dat to ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error

That's a partial list and note that all keys are different. I also tried re-enable the OM process on previous leader OM, but it doesn't help since the leader has changed. Also attach partial OM logs:

 2020-02-12 14:57:11,128 [IPC Server handler 72 on 9862] INFO org.apache.hadoop.ipc.Server: IPC Server handler 72 on 9862, call Call#4859 Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 9.134.50.210:36561
 org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the leader. Suggested leader is OM:om2.
 at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
 at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
 at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
 at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
 at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
 2020-02-12 14:57:11,918 [IPC Server handler 159 on 9862] INFO org.apache.hadoop.ipc.Server: IPC Server handler 159 on 9862, call Call#4864 Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 9.134.50.210:36561
 org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the leader. Suggested leader is OM:om2.
 at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
 at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
 at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
 at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
 at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
 2020-02-12 14:57:15,395 [IPC Server handler 23 on 9862] INFO org.apache.hadoop.ipc.Server: IPC Server handler 23 on 9862, call Call#4869 Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 9.134.50.210:36561
 org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the leader. Suggested leader is OM:om2.
 at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
 at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
 at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
 at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
 at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)

Also attach the ozone-site.xml config to enable OM HA:

<property>
 <name>ozone.om.service.ids</name>
 <value>OMHA</value>
 </property>
 <property>
 <name>ozone.om.nodes.OMHA</name>
 <value>om1,om2,om3</value>
 </property>
 <property>
 <name>ozone.om.node.id</name>
 <value>om1</value>
 </property>
 <property>
 <name>ozone.om.address.OMHA.om1</name>
 <value>9.134.50.210:9862</value>
 </property>
 <property>
 <name>ozone.om.address.OMHA.om2</name>
 <value>9.134.51.215:9862</value>
 </property>
 <property>
 <name>ozone.om.address.OMHA.om3</name>
 <value>9.134.51.25:9862</value>
 </property>
 <property>
 <name>ozone.om.ratis.enable</name>
 <value>true</value>
 </property>
 <property>
 <name>ozone.enabled</name>
 <value>true</value>
 <tag>OZONE, REQUIRED</tag>
 <description>
 Status of the Ozone Object Storage service is enabled.
 Set to true to enable Ozone.
 Set to false to disable Ozone.
 Unless this value is set to true, Ozone services will not be started in
 the cluster.

Please note: By default ozone is disabled on a hadoop cluster.
 </description>
 </property>