Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10676

FieldInfo#name contributes significantly to heap usage at scale

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Minor
    • Resolution: Unresolved
    • 9.3
    • None
    • core/codecs
    • Seen in Lucene 9.3.0 running on Linux using JDK18 but seems independent of environment.

    • New, Patch Available

    Description

      We encountered an Elasticsearch user with high heap usage, a significant proportion of which was down to the contents of `FieldInfo#name`.

      This user was certainly pushing some scalability boundaries: this single process had thousands of active Lucene indices, many with 10k+ fields, and many indices had hundreds of segments due to an excess of flushes, so in total they had an enormous number of `FieldInfo` instances. Still, the bulk of the heap usage was just field names, and the total number of distinct field names was fairly small. That's pretty common, especially for time-based data like logs. Some kind of interning or deduplication of these strings would have reduced their heap usage by many GBs.

      Is there a way we could deduplicate these strings? Deduplicating them across segments within each index would already have helped, but ideally we'd like to deduplicate them across indices too.

      Attachments

        Activity

          People

            Unassigned Unassigned
            David Turner David Turner
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: