Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-27957

HConnection (and ZookeeprWatcher threads) leak in case of AUTH_FAILED exception.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 1.7.2, 2.4.17
    • None
    • Client
    • None

    Description

      Observed this in production environment running some version of 1.7 release.
      Application didn't had the right keytab setup for authentication. Application was trying to create HConnection and zookeeper server threw AUTH_FAILED exception.
      After few hours of application in this state, saw thousands of zk-event-processor thread with below stack trace.

      "zk-event-processor-pool1-t1" #1275 daemon prio=5 os_prio=0 cpu=1.04ms elapsed=41794.58s tid=0x00007fd7805066d0 nid=0x1245 waiting on condition  [0x00007fd75df01000]
         java.lang.Thread.State: WAITING (parking)
              at jdk.internal.misc.Unsafe.park(java.base@11.0.18.0.102/Native Method)
              - parking to wait for  <0x00007fd9874a85e0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
              at java.util.concurrent.locks.LockSupport.park(java.base@11.0.18.0.102/LockSupport.java:194)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.18.0.102/AbstractQueuedSynchronizer.java:2081)
              at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.18.0.102/LinkedBlockingQueue.java:433)
              at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.18.0.102/ThreadPoolExecutor.java:1054)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.18.0.102/ThreadPoolExecutor.java:1114)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.18.0.102/ThreadPoolExecutor.java:628)
      
      ConnectionManager.java
      HConnectionImplementation(Configuration conf, boolean managed,
              ExecutorService pool, User user, String clusterId) throws IOException {
              ...
              ...
              try {
                 this.registry = setupRegistry();
                 retrieveClusterId();
                 ...
                 ...
              } catch (Throwable e) {
                 // avoid leaks: registry, rpcClient, ...
                 LOG.debug("connection construction failed", e);
                 close();
                 throw e;
               }
      

      retrieveClusterId internally calls ZKConnectionRegistry#getClusterId

      ZKConnectionRegistry.java
        private String clusterId = null;
      
        @Override
        public String getClusterId() {
          if (this.clusterId != null) return this.clusterId;
          // No synchronized here, worse case we will retrieve it twice, that's
          //  not an issue.
          try (ZooKeeperKeepAliveConnection zkw = hci.getKeepAliveZooKeeperWatcher()) {
            this.clusterId = ZKClusterId.readClusterIdZNode(zkw);
            if (this.clusterId == null) {
              LOG.info("ClusterId read in ZooKeeper is null");
            }
          } catch (KeeperException | IOException e) {      --->  WE ARE SWALLOWING THIS EXCEPTION AND RETURNING NULL. 
      
            LOG.warn("Can't retrieve clusterId from Zookeeper", e);
          }
          return this.clusterId;
        }
      

      ZkConnectionRegistry#getClusterId threw the following exception.(Our logging system trims stack traces longer than 5 lines.)

      Cause: org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed for /hbase/hbaseid
      StackTrace: 
      org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
      org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
      org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1213)
      org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:285)
      org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:470)
      

      We should throw KeeperException from ZKConnectionRegistry#getClusterId all the way back to HConnectionImplementation constructor to close all the watcher threads and throw the exception back to the caller.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              shahrs87 Rushabh Shah
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: