Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-5919

In kubernettes om HA has circular dependency on the service availability

    XMLWordPrintableJSON

Details

    Description

      In Kubernettes, for OM HA, we need to specify each OM FQDN in the configuration. However, the OM address is in the form <om_pod_name>.<om_service_name>. During the OM initialization, OM needs to resolve the FQDN <om_pod_name>.<om_service_name>. But this FQDN can only be resolvable if the OM is in ready state (the OM service only includes the pods in ready states). It is kind of circular dependency.

       

      My current hacking resolution is to replace the FQDN name with the local host name (om-0.omservice vs om-0) in ozone-site.xml config before the OM initialization. However, the side effect of this solution is that the recon component cannot be launched, because when recon look up the list of the om peers, the return list would be something like: om-0 (the leader), om-1.omservice, om-2.omservice, and the leader om-0 cannot be accessed.

      I feel the current ozone is more targeting to bare metal deployment (IPs do not change). We should take kubernettes environment, where the ip could be dynamic (node rescheduled, or whole app is redeployed for upgrading), into account.

      2021-11-01 18:55:55 ERROR OzoneManagerServiceProviderImpl:315 - Unable to obtain Ozone Manager DB Snapshot.2021-11-01 18:55:55 ERROR OzoneManagerServiceProviderImpl:315 - Unable to obtain Ozone Manager DB Snapshot.java.net.UnknownHostException: Error while authenticating with endpoint: http://test-ozone-om-uat-0:9874/dbCheckpoint at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.wrapExceptionWithMessage(KerberosAuthenticator.java:232) at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:216) at org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:348) at org.apache.hadoop.hdfs.web.URLConnectionFactory.openConnection(URLConnectionFactory.java:186) at org.apache.hadoop.ozone.recon.ReconUtils.makeHttpCall(ReconUtils.java:237) at org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.lambda$getOzoneManagerDBSnapshot$1(OzoneManagerServiceProviderImpl.java:298) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) at org.apache.hadoop.security.SecurityUtil.doAsUser(SecurityUtil.java:535) at org.apache.hadoop.security.SecurityUtil.doAsLoginUser(SecurityUtil.java:516) at org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.getOzoneManagerDBSnapshot(OzoneManagerServiceProviderImpl.java:297) at org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.updateReconOmDBWithNewSnapshot(OzoneManagerServiceProviderImpl.java:329) at org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.syncDataFromOM(OzoneManagerServiceProviderImpl.java:427) at org.apache.hadoop.ozone.recon.spi.impl.OzoneManagerServiceProviderImpl.lambda$start$0(OzoneManagerServiceProviderImpl.java:233) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)Caused by: java.net.UnknownHostException: test-ozone-om-uat-0 at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220) at java.base/java.net.Socket.connect(Socket.java:609) at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177) at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474) at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569) at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242) at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341) at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362) at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1253) at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187) at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081) at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015) at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:189) ... 19 more

      Attachments

        Issue Links

          Activity

            People

              sokui Shawn
              sokui Shawn
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: