[HBASE-27112] Investigate Netty resource usage limits - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Done
Affects Version/s: 2.5.0
Fix Version/s: None
Component/s: IPC/RPC
Labels:
None

Description

We leave Netty level resource limits unbounded. The number of threads to use for the event loop is default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is INT_MAX.

We don't do that for our own RPC handlers. We have a notion of maximum handler pool size, with a default of 30, typically raised in production by the user. We constrain the depth of the request queue in multiple ways... limits on number of queued calls, limits on total size of calls data that can be queued (to avoid memory usage overrun, CoDel conditioning of the call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct buffers containing request bytes, at the netty layer because of downstream resource limits? Those limits will act as a bottleneck, as intended, and before would have also applied backpressure through RPC too, because SimpleRpcServer had thread limits ("hbase.ipc.server.read.threadpool.size", default 10), but Netty may be able to queue up a lot more, in comparison, because Netty has been optimized to prefer concurrency.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 (unbounded). I don't know what it can actually get up to in production, because we lack the metric, but there are diminishing returns when threads > cores so a reasonable default here could be Runtime.getRuntime().availableProcessors() instead of unbounded?

maxPendingTasks probably should not be INT_MAX, but that may matter less.

The tasks here are:

Instrument netty level resources to understand better actual resource allocations under load. Investigate what we need to plug in where to gain visibility.
Where instrumentation designed for this issue can be implemented as low overhead metrics, consider formally adding them as a metric.
Based on the findings from this instrumentation, consider and implement next steps. The goal would be to limit concurrency at the Netty layer in such a way that performance is still good, and under load we don't balloon resource usage at the Netty layer.

If the instrumentation and experimental results indicate no changes are necessary, we can close this as Not A Problem or WontFix.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Image 7-12-22 at 10.45 PM.jpg
13/Jul/22 05:46
543 kB
Michael Stack
Image 7-11-22 at 10.12 PM.jpg
12/Jul/22 05:41
525 kB
Michael Stack

Activity

People

Assignee:: Unassigned

Reporter:: Andrew Kyle Purtell

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 12/Jun/22 17:55

Updated:: 07/Oct/22 18:46

Resolved:: 07/Oct/22 18:46