Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4741

High latency under heavy load when prometheus metrics enabled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.6.0, 3.6.3, 3.6.1, 3.6.2, 3.6.4
    • None
    • server
    • zookeeper version: 3.6.4

      kernel: 3.10.0-1160.95.1.el7.x86_64

      java version "1.8.0_111"

       

      metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider

    Description

      In our production, we use zookeeper built-in PrometheusMetricsProvider to monitor zookeeper status, recently we observed very high latency in one of our zookeeper cluster which serve heavy load.

      Measured in a heavy client side, the latency could more than 25 seconds. 

      [arthas@8]$ watch org.apache.zookeeper.ClientCnxn submitRequest
      Press Q or Ctrl+C to abort.
      Affect(class count: 1 , method count: 1) cost in 294 ms, listenerId: 1
      method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit
      ts=2023-09-06 16:01:32; [cost=28180.767833ms] result=@ArrayList[
          @Object[][isEmpty=false;size=4],
          @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/<IP>:40942 remoteserver:<IP>/<IP>:2181 lastZxid:146177807315 xid:589796 sent:589796 recv:973892 queuedpkts:0 pendingresp:4398 queuedevents:0],
          @ReplyHeader[585397,146177807315,0
      ],
      ]
      method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit
      ts=2023-09-06 16:01:34; [cost=29052.726493ms] result=@ArrayList[
          @Object[][isEmpty=false;size=4],
          @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/<IP>:40942 remoteserver:<IP>/<IP>:2181 lastZxid:146177807862 xid:589989 sent:589989 recv:974165 queuedpkts:0 pendingresp:4400 queuedevents:0],
          @ReplyHeader[585588,146177807862,0
      ]

       

      Observed many connections with high Recv-Q on the server side. 

      CommitProcWorkThread BLOCKED in org.apache.zookeeper.server.ServerStats#updateLatency:

      "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x00007f86d804a000 nid=0x6bca waiting for monitor entry [0x00007f86deb95000]
         java.lang.Thread.State: BLOCKED (on object monitor)
              at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91)
              - waiting to lock <0x0000000784dd1a18> (a io.prometheus.client.CKMSQuantiles)
              at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38)
              at io.prometheus.client.Summary$Child.observe(Summary.java:281)
              at io.prometheus.client.Summary.observe(Summary.java:307)
              at org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355)
              at org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153)
              at org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669)
              at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585)
              at org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545)
              at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      
         Locked ownable synchronizers:
              - <0x0000000734923e80> (a java.util.concurrent.ThreadPoolExecutor$Worker)

       

      The wall clock profile shows that there is lock contention within `CommitProcWorkThread` threads.

      Attachments

        1. 32010.lock.profile.html
          479 kB
          Yike Xiao
        2. 32010.threaddump.001.txt
          161 kB
          Yike Xiao
        3. 32010.wallclock.profile.html
          1.05 MB
          Yike Xiao
        4. image-2023-09-11-16-17-21-166.png
          240 kB
          Yike Xiao

        Issue Links

          Activity

            People

              Unassigned Unassigned
              shawyeok Yike Xiao
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m