Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Done
    • 2.5.0
    • None
    • IPC/RPC
    • None

    Description

      We leave Netty level resource limits unbounded. The number of threads to use for the event loop is default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is INT_MAX.

      We don't do that for our own RPC handlers. We have a notion of maximum handler pool size, with a default of 30, typically raised in production by the user. We constrain the depth of the request queue in multiple ways... limits on number of queued calls, limits on total size of calls data that can be queued (to avoid memory usage overrun, CoDel conditioning of the call queues if it is enabled, and so on.

      Under load can we pile up a excess of pending request state, such as direct buffers containing request bytes, at the netty layer because of downstream resource limits? Those limits will act as a bottleneck, as intended, and before would have also applied backpressure through RPC too, because SimpleRpcServer had thread limits ("hbase.ipc.server.read.threadpool.size", default 10), but Netty may be able to queue up a lot more, in comparison, because Netty has been optimized to prefer concurrency.

      Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 (unbounded). I don't know what it can actually get up to in production, because we lack the metric, but there are diminishing returns when threads > cores so a reasonable default here could be Runtime.getRuntime().availableProcessors() instead of unbounded?

      maxPendingTasks probably should not be INT_MAX, but that may matter less.

      The tasks here are:

      • Instrument netty level resources to understand better actual resource allocations under load. Investigate what we need to plug in where to gain visibility.
      • Where instrumentation designed for this issue can be implemented as low overhead metrics, consider formally adding them as a metric.
      • Based on the findings from this instrumentation, consider and implement next steps. The goal would be to limit concurrency at the Netty layer in such a way that performance is still good, and under load we don't balloon resource usage at the Netty layer.

      If the instrumentation and experimental results indicate no changes are necessary, we can close this as Not A Problem or WontFix.

      Attachments

        1. Image 7-12-22 at 10.45 PM.jpg
          543 kB
          Michael Stack
        2. Image 7-11-22 at 10.12 PM.jpg
          525 kB
          Michael Stack

        Activity

          People

            Unassigned Unassigned
            apurtell Andrew Kyle Purtell
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: