Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-913 Umbrella: Add a way to register long-lived services in a YARN cluster
  3. YARN-6136

YARN registry service should avoid scanning whole ZK tree for every container/application finish

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Critical
    • Resolution: Invalid
    • None
    • None
    • api, resourcemanager
    • None

    Description

      In existing registry service implementation, purge operation triggered by container finish event:

        public void onContainerFinished(ContainerId id) throws IOException {
          LOG.info("Container {} finished, purging container-level records",
              id);
          purgeRecordsAsync("/",
              id.toString(),
              PersistencePolicies.CONTAINER);
        }
      

      Since this happens on every container finish, so it essentially scans all (or almost) ZK node from the root.

      We have a cluster which have hundreds of ZK nodes for service registry, and have 20K+ ZK nodes for other purposes. The existing implementation could generate massive ZK operations and internal Java objects (RegistryPathStatus) as well. The RM becomes very unstable when there're batch container finish events because of full GC pause and ZK connection failure.

      Attachments

        Issue Links

          Activity

            People

              leftnoteasy Wangda Tan
              leftnoteasy Wangda Tan
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: