[SOLR-9142] JSON Facet, add hash table method for terms - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.3
Component/s: Facet Module
Labels:
None

Description

I indexed a dataset of 2M docs

top_facet_s has a cardinality of 1000 which is the top level facet.
For nested facets it has two fields sub_facet_unique_s and sub_facet_unique_td which are string and double and have cardinality 2M

The nested query for the double field returns in the 1s mark always. The nested query for the string field takes roughly 10s to execute.

nested string facet

q=*:*&rows=0&json.facet=
	{
		"top_facet_s": {
			"type": "terms",
			"limit": -1,
			"field": "top_facet_s",
			"mincount": 1,
			"excludeTags": "ANY",
			"facet": {
				"sub_facet_unique_s": {
					"type": "terms",
					"limit": 1,
					"field": "sub_facet_unique_s",
					"mincount": 1
				}
			}
		}
	}

nested double facet

q=*:*&rows=0&json.facet=
	{
		"top_facet_s": {
			"type": "terms",
			"limit": -1,
			"field": "top_facet_s",
			"mincount": 1,
			"excludeTags": "ANY",
			"facet": {
				"sub_facet_unique_s": {
					"type": "terms",
					"limit": 1,
					"field": "sub_facet_unique_td",
					"mincount": 1
				}
			}
		}
	}

I tried to dig deeper to understand why are string nested faceting that slow compared to numeric field

Since the top facet has a cardinality of 1000 we have to calculate sub facets on each of them. Now the key difference was in the implementation of the two .

For the string field, In FacetField#getFieldCacheCounts we call createCollectAcc with nDocs=0 and numSlots=2M . This then initializes an array of 2M. So we create a 2M array 1000 times for this one query which from what I understand makes this query slow.

For numeric fields FacetFieldProcessorNumeric#calcFacets uses a CountSlotAcc which doesn't assign a huge array. In this query it calls createCollectAcc with numDocs=2k and numSlots=1024 .

In string faceting, we create the 2M array because the cardinality is 2M and we use the array position as the ordinal and value as the count. If we could improve on this it would speed things up significantly? For sub-facets we know the maximum cardinality can be at max the top level bucket count.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR_9412_FacetFieldProcessorByHashDV.patch
24/Aug/16 18:31
22 kB
David Smiley
SOLR_9412_FacetFieldProcessorByHashDV.patch
26/Aug/16 21:33
22 kB
David Smiley
SOLR_9412_FacetFieldProcessorByHashDV.patch
29/Aug/16 19:10
44 kB
David Smiley
SOLR_9412_FacetFieldProcessorByHashDV.patch
31/Aug/16 17:41
44 kB
David Smiley
SOLR_9412_FacetFieldProcessorByHashDV.patch
31/Aug/16 18:12
44 kB
David Smiley

Activity

People

Assignee:: David Smiley

Reporter:: Varun Thacker

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 21/May/16 08:19

Updated:: 09/Nov/16 08:39

Resolved:: 31/Aug/16 21:18