Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4358

Latency metrics showing surprising results for a keberos-enabled cluster

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Not A Problem
    • 3.6.2
    • None
    • metric system
    • None

    Description

      Hi,

      I'm trying to understand why the values of min/avg/max latency are showing surprising results. The graph below shows the max latency value of a particular node for last 7 days. The value increases gradually over time and it only ever decreases when the node gets restarted as if the metric value gets reset.

      https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/ServerStats.java#L226

      • 3 nodes
      • Keberos enabled
      • TGT ticket cashe enabled.

      I believes the values of min/avg/max latency should show more realistic variations. It's very unlikely that the max latency value is expected to always increase while the node is running.

      https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/ServerStats.java#L142

      public void updateLatency(Request request, long currentTime) {
      long latency = currentTime - request.createTime;
      if (latency < 0) {
      return;
      }
      requestLatency.addDataPoint(latency);
      if (request.getHdr() != null) {
      // Only quorum request should have header
      ServerMetrics.getMetrics().UPDATE_LATENCY.add(latency);
      } else {
      // All read request should goes here
      ServerMetrics.getMetrics().READ_LATENCY.add(latency);
      }

      The method called let me think that the max latency metric gets set if the current values happens to be lower. __ 

      https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/metric/AvgMinMaxCounter.java#L51

      private void setMax(long value) {
      long current;
      while (value > (current = max.get()) && !max.compareAndSet(current, value)) {
      // no op
      }
      }

      I put below a graph of a particular from a totally different cluster for last 2 days. The node has not been restarted and all the data is from the same process. We can see a more realistic variations of the max latency metric as it would normally. 

      Thanks for you time in advance,

      Math

      Attachments

        1. image-2021-08-27-16-37-50-112.png
          65 kB
          Mathieu Gaudin
        2. image-2021-08-27-16-10-28-783.png
          36 kB
          Mathieu Gaudin

        Activity

          People

            Unassigned Unassigned
            mgaudin Mathieu Gaudin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: