Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-14660

Improve TokenMetaData cache populating performance for large cluster

    XMLWordPrintableJSON

Details

    Description

      TokenMetaData#cachedOnlyTokenMap is a method C* used to get a consistent token and topology view on coordinations without paying read lock cost. Upon first read the method acquire a synchronize lock and generate a copy of major token meta data structures and cached it, and upon every token meta data changes(due to gossip changes), the cache get cleared and next read will taking care of cache population.

      For small to medium size clusters this strategy works pretty well. But large clusters can actually be suffered from the locking since cache populating is much slower. On one of our largest cluster (~1000 nodes,  125k tokens, C* 3.0.15)  each cache population take about 500~700ms, and during that there are no requests can go through since synchronize lock was acquired. This caused waves of timeouts errors when there are large amount gossip messages propagating cross the cluster, such as in the case of cluster restarting.

      Base on profiling we found that the cost mostly comes from copying tokenToEndpointMap. It is a SortedBiMultiValueMap made from a forward map use TreeMap and a reverse map use guava TreeMultiMap. There is an optimization in TreeMap helps reduce copying complexity from O(N*log(N)) to O(N) when copying from already ordered data. But guava's TreeMultiMap copying missed that optimization and make it ~10 times slower than it actually need to be on our size of cluster.

      The patch attached to the issue replace the reverse TreeMultiMap<K, V> to a vanilla TreeMap<K, TreeSet<V>> in SortedBiMultiValueMap to make sure we can copy it O(N) time.

      I also attached a benchmark script (TokenMetaDataBenchmark.java), which simulates a large cluster then measures average latency for TokenMetaData cache populating.

      Benchmark result before and after that patch:

      trunk: 
      before 100ms, after 13ms
      3.0.x: 
      before 199ms, after 15ms
       

      (On 3.0.x even the forward TreeMap copying is slow, the O(N*log(N)) to O(N) optimization is not applied because the key comparator is dynamically created and TreeMap cannot determine the source and dest are in same order)

      Attachments

        1. TokenMetaDataBenchmark.java
          3 kB
          Pengchao Wang
        2. 14660-3.0.txt
          2 kB
          Pengchao Wang
        3. 14660-trunk.txt
          3 kB
          Pengchao Wang

        Issue Links

          Activity

            People

              wpc Pengchao Wang
              wpc Pengchao Wang
              Pengchao Wang
              Benedict Elliott Smith
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: