Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-17200

"False Positive" Race conditions using "/health?requireHealthyCores=true" near startup

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • main (10.0), 9.6
    • None
    • None

    Description

      There seem to be at least two possible thread race conditions that can lead /health?requireHealthyCores=true to returning false positive while CoreContainer is in the process of starting up.

      1. If the request comes in after CoreContainer has initialized healthCheckHandler but before initializing & running the coreLoadExecutor
      2. A more complex situation where the request comes in while coreLoadExecutor is loading cores, and all of the cores that have finished initialization are "active" in SolrCloud, but other SolrCores remain to be initialized (and may need recovery)

      In both cases, the root of the issue is that requireHealthyCores=true works by checking...

            Collection<CloudDescriptor> coreDescriptors =
                coreContainer.getCores().stream()
                    .map(c -> c.getCoreDescriptor().getCloudDescriptor())
                    .collect(Collectors.toList());
            long unhealthyCores = findUnhealthyCores(coreDescriptors, clusterState);
      

      ..but that means the only CloudDescriptor s that are checked are the ones that come from loaded cores (which is what coreContainer.getCores() returns). and any currentlyLoadingCores (registered by CoreContainer calling solrCores.markCoreAsLoading(cd) before starting the coreLoadExecutor ) are not considered.

      Attachments

        1. SOLR-17200.patch
          2 kB
          Chris M. Hostetter

        Activity

          People

            hossman Chris M. Hostetter
            hossman Chris M. Hostetter
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: