Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-12820

Auto pick method:dvhash based on thresholds

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Facet Module
    • None

    Description

      I've worked with two users last week where explicitly using method:dvhash improved the faceting speeds drastically.

      The common theme in both the use-cases were:  One collection hosting data for multiple users.  We always filter documents for one user ( therby limiting the number of documents drastically ) and then perfoming a complex nested JSON facet.

      Both use-cases fit perfectly in this criteria that Yonik Seeley mentioed on SOLR-9142

      faceting on a string field with a high cardinality compared to it's domain is less efficient than it could be.

      And DVHASH was the perfect optimization for these use-cases.

      We are using the facet stream expression in one of the use-cases which doesn't expose the method param. We could expose the method param to facet stream but I feel the better approach to solve this problem would be to address this TODO in the code withing the JSON Facet Module

            if (mincount > 0 && prefix == null && (ntype != null || method == FacetMethod.DVHASH)) {
              // TODO can we auto-pick for strings when term cardinality is much greater than DocSet cardinality?
              //   or if we don't know cardinality but DocSet size is very small
              return new FacetFieldProcessorByHashDV(fcontext, this, sf);

      I thought about this a little and this was the approach I am thinking currently to tackle this problem

      int matchingDocs = fcontext.base.size();
      int totalDocs = fcontext.searcher.getIndexReader().maxDoc();
      //if matchingDocs is close to the totalDocs then we aren't filtering many documents.
      //that means the array approach would probably be better than the dvhash approach
      
      //Trying to find the cardinality for the matchingDocs would be expensive.
      //Also for totalDocs we don't have a global cardinality present at index time but we have a per segment cardinality
      
      //So using the number of matches as an alternate heuristic would do the job here?

      Any thoughts if this approach makes sense? it could be I'm thinking of this approach just because both the users I worked with last week fell in this cateogory.

       

      cc David Smiley [~joel.bernstein]

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            varun Varun Thacker

            Dates

              Created:
              Updated:

              Slack

                Issue deployment