Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4741

High latency under heavy load when prometheus metrics enabled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.6.0, 3.6.3, 3.6.1, 3.6.2, 3.6.4
    • None
    • server
    • zookeeper version: 3.6.4

      kernel: 3.10.0-1160.95.1.el7.x86_64

      java version "1.8.0_111"

       

      metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider

    Description

      In our production, we use zookeeper built-in PrometheusMetricsProvider to monitor zookeeper status, recently we observed very high latency in one of our zookeeper cluster which serve heavy load.

      Measured in a heavy client side, the latency could more than 25 seconds. 

      [arthas@8]$ watch org.apache.zookeeper.ClientCnxn submitRequest
      Press Q or Ctrl+C to abort.
      Affect(class count: 1 , method count: 1) cost in 294 ms, listenerId: 1
      method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit
      ts=2023-09-06 16:01:32; [cost=28180.767833ms] result=@ArrayList[
          @Object[][isEmpty=false;size=4],
          @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/<IP>:40942 remoteserver:<IP>/<IP>:2181 lastZxid:146177807315 xid:589796 sent:589796 recv:973892 queuedpkts:0 pendingresp:4398 queuedevents:0],
          @ReplyHeader[585397,146177807315,0
      ],
      ]
      method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit
      ts=2023-09-06 16:01:34; [cost=29052.726493ms] result=@ArrayList[
          @Object[][isEmpty=false;size=4],
          @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/<IP>:40942 remoteserver:<IP>/<IP>:2181 lastZxid:146177807862 xid:589989 sent:589989 recv:974165 queuedpkts:0 pendingresp:4400 queuedevents:0],
          @ReplyHeader[585588,146177807862,0
      ]

       

      Observed many connections with high Recv-Q on the server side. 

      CommitProcWorkThread BLOCKED in org.apache.zookeeper.server.ServerStats#updateLatency:

      "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x00007f86d804a000 nid=0x6bca waiting for monitor entry [0x00007f86deb95000]
         java.lang.Thread.State: BLOCKED (on object monitor)
              at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91)
              - waiting to lock <0x0000000784dd1a18> (a io.prometheus.client.CKMSQuantiles)
              at io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38)
              at io.prometheus.client.Summary$Child.observe(Summary.java:281)
              at io.prometheus.client.Summary.observe(Summary.java:307)
              at org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355)
              at org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153)
              at org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669)
              at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585)
              at org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545)
              at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      
         Locked ownable synchronizers:
              - <0x0000000734923e80> (a java.util.concurrent.ThreadPoolExecutor$Worker)

       

      The wall clock profile shows that there is lock contention within `CommitProcWorkThread` threads.

      Attachments

        1. image-2023-09-11-16-17-21-166.png
          240 kB
          Yike Xiao
        2. 32010.threaddump.001.txt
          161 kB
          Yike Xiao
        3. 32010.wallclock.profile.html
          1.05 MB
          Yike Xiao
        4. 32010.lock.profile.html
          479 kB
          Yike Xiao

        Issue Links

          Activity

            People

              Unassigned Unassigned
              shawyeok Yike Xiao
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m