[SOLR-15260] Precompute snippet delimiter breaks for the UnifiedHighlighter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: highlighter
Labels:
None

Description

The "BreakIterator" implementation inside the UnifiedHighlighter can play a significant role in the performance of highlighting. The default ones are based in the JDK and thus we don't have control over them but they may very well be optimized but have a complicated job to do. I propose that the break locations be computed at indexing time in a Solr UpdateRequestProcessor and place them into a pre analyzed common field named maybe _highlighter_breaks_ that needs indexed=true plus offsets. In this field, the term is the actual field name, the position is meaningless, and the offset pair refers to the span of the break iterator (typically a sentence). This data can be efficiently stored in Lucene. The UnifiedHighlighter already has a flexible BreakIterator producer but it's not notified of the current document, and so changes would be needed there (separate LUCENE issue).

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: David Smiley

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 14/Mar/21 17:12

Updated:: 12/Apr/21 20:17