[SOLR-12820] Auto pick method:dvhash based on thresholds - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Facet Module
Labels:
None

Description

I've worked with two users last week where explicitly using method:dvhash improved the faceting speeds drastically.

The common theme in both the use-cases were: One collection hosting data for multiple users. We always filter documents for one user ( therby limiting the number of documents drastically ) and then perfoming a complex nested JSON facet.

Both use-cases fit perfectly in this criteria that Yonik Seeley mentioed on ~~SOLR-9142~~

faceting on a string field with a high cardinality compared to it's domain is less efficient than it could be.

And DVHASH was the perfect optimization for these use-cases.

We are using the facet stream expression in one of the use-cases which doesn't expose the method param. We could expose the method param to facet stream but I feel the better approach to solve this problem would be to address this TODO in the code withing the JSON Facet Module

      if (mincount > 0 && prefix == null && (ntype != null || method == FacetMethod.DVHASH)) {
        // TODO can we auto-pick for strings when term cardinality is much greater than DocSet cardinality?
        //   or if we don't know cardinality but DocSet size is very small
        return new FacetFieldProcessorByHashDV(fcontext, this, sf);

I thought about this a little and this was the approach I am thinking currently to tackle this problem

int matchingDocs = fcontext.base.size();
int totalDocs = fcontext.searcher.getIndexReader().maxDoc();
//if matchingDocs is close to the totalDocs then we aren't filtering many documents.
//that means the array approach would probably be better than the dvhash approach

//Trying to find the cardinality for the matchingDocs would be expensive.
//Also for totalDocs we don't have a global cardinality present at index time but we have a per segment cardinality

//So using the number of matches as an alternate heuristic would do the job here?

Any thoughts if this approach makes sense? it could be I'm thinking of this approach just because both the users I worked with last week fell in this cateogory.

cc David Smiley [~joel.bernstein]