Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9663

Adding compression to terms dict from SortedSet/Sorted DocValues

Details

    • Improvement
    • Status: Closed
    • Trivial
    • Resolution: Fixed
    • None
    • 8.9
    • core/codecs
    • None
    • New

    Description

      Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.
      LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do better by replacing prefix-compression with LZ4. In one of our application, the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
      I've done simple tests based on the real application data, comparing the write/merge time cost, and the on-disk *.dvd file size(after merge into 1 segment).

        Before After
      Write time cost(ms) 591972 618200
      Merge time cost(ms) 270661 294663
      *.dvd file size(GB) 1.95 1.15

      This feature is only for the high-cardinality fields. 
      I'm doing the benchmark test based on luceneutil. Will attach the report and patch after the test.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Jaison Jaison.Bi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 11h 20m
                  11h 20m