Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-14167

Exact unique counts when shards contain disjoint values

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • Facet Module
    • None

    Description

      Currently when dealing with fields with high cardinality the facet module offers two implementations (unique, hll) that give approximate results. There is one corner case where a distributed search against a high cardinality field should still be able to efficiently provide an exact result, that is when the shards are known to contain disjoint values i.e. there are duplicates within a shard, but no value exists on more than 1 shard.

      That happens to be the case in the collection I have, but this feels to me like a very niche use case. Is this functionality too niche for inclusion into the Facet module?

      I attach a naive (untested) example implementation. It could be made slightly more efficient if SlotAcc implementations that didn't populate the first 100 values were used (or if this behaviour was made configurable, perhaps via the FacetContext?).

      Slightly off topic, but the documentation currently says of unique "Beyond 100 values it yields not exact estimate". My understanding is that this is actually only true when doing distributed facetting, and that it is exact for the non-distrubuted case.

      UniqueAgg calculates sumUnique, but does not appear to actually use it.

      Attachments

        1. UniqueSumPerShard.java
          2 kB
          Daniel Lowe

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dan2097 Daniel Lowe
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: