Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-18123

Reuse of metadata collector can break key count calculation

    XMLWordPrintableJSON

Details

    • Degradation - Performance Bug/Regression
    • Normal
    • Normal
    • User Report
    • All
    • None

    Description

      When flushing a memtable we currently pass a constructed MetadataCollector to the SSTableMultiWriter that is used for writing sstables. The latter may decide to split the data into multiple sstables (e.g. for separate disks or driven by compaction strategy) — if it does so, the cardinality estimation component in the reused MetadataCollector for each individual sstable contains the data for all of them.

      As a result, when such sstables are compacted the estimation for the number of keys in the resulting sstables, which is used to determine the size of the bloom filter for the compaction result, is heavily overestimated.

      This results in much bigger L1 bloom filters than they should be. One example (which came about during testing of the upcoming CEP-26, after insertion of 100GB data with 10% reads):
      (current)

       		Bloom filter false positives: 22627369
       		Bloom filter false ratio: 0.02257
       		Bloom filter space used: 1848247864
       		Bloom filter off heap memory used: 2338964088
      

      (fixed)

       		Bloom filter false positives: 24426545
       		Bloom filter false ratio: 0.02429
       		Bloom filter space used: 1118910096
       		Bloom filter off heap memory used: 1532357432
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              blambov Branimir Lambov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: