[HBASE-22978] Online slow response log - ASF JIRA

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0-alpha-1, 2.3.0
Fix Version/s: 3.0.0-alpha-1, 2.3.0
Component/s: Admin, Operability, regionserver, shell
Labels:
None

Release Note:

Hide
get_slowlog_responses and clear_slowlog_responses are used to retrieve and clear slow RPC logs from RingBuffer maintained by RegionServers.

New Admin APIs:
1. List<SlowLogRecord> getSlowLogResponses(final Set<ServerName> serverNames,
      final SlowLogQueryFilter slowLogQueryFilter) throws IOException;

2. List<Boolean> clearSlowLogResponses(final Set<ServerName> serverNames)
      throws IOException;

Configs:

1. hbase.regionserver.slowlog.ringbuffer.size:
Default size of ringbuffer to be maintained by each RegionServer in order to store online slowlog responses. This is an in-memory ring buffer of requests that were judged to be too slow in addition to the responseTooSlow logging. The in-memory representation would be complete. For more details, please look into Doc Section: Get Slow Response Log from shell

Default
256

2. hbase.regionserver.slowlog.buffer.enabled:
Indicates whether RegionServers have ring buffer running for storing Online Slow logs in FIFO manner with limited entries. The size of the ring buffer is indicated by config: hbase.regionserver.slowlog.ringbuffer.size The default value is false, turn this on and get latest slowlog responses with complete data.

Default
false

For more details, please look into "Get Slow Response Log from shell" section from HBase book.

Show
get_slowlog_responses and clear_slowlog_responses are used to retrieve and clear slow RPC logs from RingBuffer maintained by RegionServers. New Admin APIs: 1. List<SlowLogRecord> getSlowLogResponses(final Set<ServerName> serverNames,       final SlowLogQueryFilter slowLogQueryFilter) throws IOException; 2. List<Boolean> clearSlowLogResponses(final Set<ServerName> serverNames)       throws IOException; Configs: 1. hbase.regionserver.slowlog.ringbuffer.size: Default size of ringbuffer to be maintained by each RegionServer in order to store online slowlog responses. This is an in-memory ring buffer of requests that were judged to be too slow in addition to the responseTooSlow logging. The in-memory representation would be complete. For more details, please look into Doc Section: Get Slow Response Log from shell Default 256 2. hbase.regionserver.slowlog.buffer.enabled: Indicates whether RegionServers have ring buffer running for storing Online Slow logs in FIFO manner with limited entries. The size of the ring buffer is indicated by config: hbase.regionserver.slowlog.ringbuffer.size The default value is false, turn this on and get latest slowlog responses with complete data. Default false For more details, please look into "Get Slow Response Log from shell" section from HBase book.

Description

Today when an individual RPC exceeds a configurable time bound we log a complaint by way of the logging subsystem. These log lines look like:

2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - (responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
"starttimems":1567203007549,
"responsesize":6819737,
"method":"Scan",
"param":"region { type: REGION_NAME value: \"tsdb,\\000\\000\\215\\f)o\\\\\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000<TRUNCATED>",
"processingtimems":28646,
"client":"10.253.196.215:41116",
"queuetimems":22453,
"class":"HRegionServer"}

Unfortunately we often truncate the request parameters, like in the above example. We do this because the human readable representation is verbose, the rate of too slow warnings may be high, and the combination of these things can overwhelm the log capture system. The truncation is unfortunate because it eliminates much of the utility of the warnings. For example, the region name, the start and end keys, and the filter hierarchy are all important clues for debugging performance problems caused by moderate to low selectivity queries or queries made at a high rate.

We can maintain an in-memory ring buffer of requests that were judged to be too slow in addition to the responseTooSlow logging. The in-memory representation can be complete and compressed. A new admin API and shell command can provide access to the ring buffer for online performance debugging. A modest sizing of the ring buffer will prevent excessive memory utilization for a minor performance debugging feature by limiting the total number of retained records. There is some chance a high rate of requests will cause information on other interesting requests to be overwritten before it can be read. This is the nature of a ring buffer and an acceptable trade off.

The write request types do not require us to retain all information submitted in the request. We don't need to retain all key-values in the mutation, which may be too large to comfortably retain. We only need a unique set of row keys, or even a min/max range, and total counts.

The consumers of this information will be debugging tools. We can afford to apply fast compression to ring buffer entries (if codec support is available), something like snappy or zstandard, and decompress on the fly when servicing the retrieval API request. This will minimize the impact of retaining more information about slow requests than we do today.

This proposal is for retention of request information only, the same information provided by responseTooSlow warnings. Total size of response serialization, possibly also total cell or row counts, should be sufficient to characterize the response.

Optionally persist new entries added to the ring buffer into one or more files in HDFS in a write-behind manner. If the HDFS writer blocks or falls behind and we are unable to persist an entry before it is overwritten, that is fine. Response too slow logging is best effort. If we can detect this make a note of it in the log file. Provide a tool for parsing, dumping, filtering, and pretty printing the slow logs written to HDFS. The tool and the shell can share and reuse some utility classes and methods for accomplishing that.

—

New shell commands:

get_slow_responses [ <server1> ... , <serverN> ] [ , { <filter-attributes> } ]

Retrieve, decode, and pretty print the contents of the too slow response ring buffer maintained by the given list of servers; or all servers in the cluster if no list is provided. Optionally provide a map of parameters for filtering as additional argument. The TABLE filter, which expects a string containing a table name, will include only entries pertaining to that table. The REGION filter, which expects a string containing an encoded region name, will include only entries pertaining to that region. The CLIENT_IP filter, which expects a string containing an IP address, will include only entries pertaining to that client. The USER filter, which expects a string containing a user name, will include only entries pertaining to that user. Filters are additive, for example if both CLIENT_IP and USER filters are given, entries matching either or both conditions will be included. The exception to this is if both TABLE and REGION filters are provided, REGION will be preferred and TABLE will be ignored. A server name is its host, port, and start code, e.g. "host187.example.com,60020,1289493121758".

clear_slow_responses [ <server1> ... , <serverN> ]

Clear the too slow response ring buffer maintained by the given list of servers; or all servers on the cluster if no argument is provided. A server name is its host, port, and start code, e.g. "host187.example.com,60020,1289493121758".

—

New Admin APIs:

List<ResponseDetail> Admin#getSlowResponses(@Nullable List<String> servers);

List<ResponseDetail> Admin#clearSlowResponses(@Nullable List<String> servers);

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NamedQueue_Framework_Design_HBASE-24528_HBASE-22978_HBASE-24718.pdf
18/Oct/21 11:04
339 kB
Viraj Jasani
Screen Shot 2019-10-19 at 2.31.59 AM.png
21/Oct/19 15:00
539 kB
Viraj Jasani
Screen Shot 2019-10-19 at 2.32.54 AM.png
21/Oct/19 15:00
434 kB
Viraj Jasani
Screen Shot 2019-10-19 at 2.34.11 AM.png
21/Oct/19 15:00
491 kB
Viraj Jasani
Screen Shot 2019-10-19 at 2.36.14 AM.png
21/Oct/19 15:00
99 kB
Viraj Jasani

Issue Links

relates to

HBASE-24718 Generic NamedQueue framework for recent in-memory history

Resolved

HBASE-26345 Design doc for HBase NamedQueue framework

Resolved

links to

GitHub Pull Request #754

GitHub Pull Request #1228

Sub-Tasks

1.	Backport HBASE-22978(with sub-tasks), HBASE-24718, HBASE-24528 to branch-1	Resolved	Viraj Jasani
2.	Thrift support for get and clear slow_log APIs	Resolved	Viraj Jasani
3.	Retrieve online large RPC logs	Resolved	Viraj Jasani
4.	Replicate slow/large RPC calls to HDFS	Resolved	Viraj Jasani
5.	Dynamically turn on RingBuffer for slow/large RPC logs	Open	Unassigned
6.	get_slowlog_responses filters with AND/OR operator support	Resolved	Viraj Jasani
7.	Document hbase:slowlog in detail	Resolved	Viraj Jasani

Online slow response log

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates