[HBASE-21461] Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single batch - ASF JIRA

Details

Type: New Feature
Status: In Progress
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: hbase-operator-tools, Replication
Labels:
None

Release Note:

Hide
This is a RegionServer CP compatible with HBase version 1 only, since it relies on *preReplicateLogEntries* method that has been dropped from RS CP API for Hbase version 2.

Code wise, is pretty similar to *ReplicationSink*, with some extra logic to analyse WALEntry size and split in smaller batches, plus some additional lines to recover replication sink metrics, which must be updated properly whenever the CP processes entries.

It's basically a workaround for stuck replication problems, so thought it's a good candidate to the operators tool. Idea here is to have it contained by a submodule of hbase-operator-tools, name it "replication", where it would be placed together with any other replication related operations tool.

Show
This is a RegionServer CP compatible with HBase version 1 only, since it relies on *preReplicateLogEntries* method that has been dropped from RS CP API for Hbase version 2. Code wise, is pretty similar to *ReplicationSink*, with some extra logic to analyse WALEntry size and split in smaller batches, plus some additional lines to recover replication sink metrics, which must be updated properly whenever the CP processes entries. It's basically a workaround for stuck replication problems, so thought it's a good candidate to the operators tool. Idea here is to have it contained by a submodule of hbase-operator-tools, name it "replication", where it would be placed together with any other replication related operations tool.

Description

With replication enabled deployments, it's possible that faulty ingestion clients may lead to single WalEntry containing too many edits for same cell. This would cause ReplicationSink, in the target cluster, to attempt single batch mutation with too many operations, what in turn can lead to very large RPC requests, which may not fit in the final target RS rpc queue. In this case, the messages below are seen on target RS trying to perform the sink:

WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, attempt=4/4 failed=2ops, last exception: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size too small? on regionserver01.example.com,60020,1524334173359, tracking started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
2018-09-07 10:40:59,506 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to accept edit because:
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 actions: RemoteWithExtrasException: 2 times, 
at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996)

When this problem manifests, replication will be stuck and wal files will be piling up on source cluster WALs/oldWALs folder. Typical workaround requires manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL files containing the large entry.

This CP would handle the issue, by checking for large wal entries and splitting those into smaller batches on the reReplicateLogEntries method hook.

Additional Note: ~~HBASE-18027~~ introduced some safeguards such large RPC requests, which may already help avoid such scenario. That is not available for 1.2 releases, though, and this CP tool may still be relevant for 1.2 clusters. It may also be still worth having it to workaround any potential unknown large RPC issue scenarios.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0001-Initial-version-for-WAL-entry-splitter-CP.txt
09/Nov/18 11:57
24 kB
Wellington Chevreuil
HBASE-21461-master.001.txt
12/Nov/18 14:29
29 kB
Wellington Chevreuil

Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single batch

Details

Description

Attachments

Attachments

Activity

People

Dates