Details
-
Improvement
-
Status: Closed
-
Trivial
-
Resolution: Fixed
-
None
-
None
-
New
Description
Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.
LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do better by replacing prefix-compression with LZ4. In one of our application, the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
I've done simple tests based on the real application data, comparing the write/merge time cost, and the on-disk *.dvd file size(after merge into 1 segment).
Before | After | |
---|---|---|
Write time cost(ms) | 591972 | 618200 |
Merge time cost(ms) | 270661 | 294663 |
*.dvd file size(GB) | 1.95 | 1.15 |
This feature is only for the high-cardinality fields.
I'm doing the benchmark test based on luceneutil. Will attach the report and patch after the test.
Attachments
Issue Links
- blocks
-
LUCENE-9737 Flexible configuration for DocValue compressions
- Open
- links to