[HBASE-8338] Latency Resilience; umbrella list of issues that will help us ride over bad disk, bad region, ec2, etc. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Umbrella
Status: Closed
Priority: Critical
Resolution: Implemented
Affects Version/s: None
Fix Version/s: None
Component/s: LatencyResilience
Labels:
None

Description

Chatting w/ Elliott, we started listing out items to fix that would help keep hbase latency approximately constant as disks went bad, were saturated by a neighbour (ec2), etc.

I must made a new LatencyResilience issue category to tag issues that contribute to this project.

I have to go at moment but when I get back I'll start to link in existing issues that help this project along and I'll file new ones.

Here is what we chatted about:

+ Multiple WALs effort will help keep write latency roughly constant.
+ Figuring how to get a new read started over dfsclient if current replica read is taking too long would help keep reads about constant (maybe could exploit the nkeywal hackery messing w/ replicas order).
+ There is an issue where client can currently pile up on a single region because of the way we do client queues by regionserver. This needs fixing.

The above are few ideas worth further exploration at least.

Idea is to try and bring down our 95percentiles and to make us more robust in the face of dying disks, etc. I see this issue rising to the fore now there has been good progress on the MTTR project.

Attachments

Issue Links

depends upon

HBASE-5699 Run with > 1 WAL in HRegionServer

Closed

HBASE-7509 Enable RS to query a secondary datanode in parallel, if the primary takes too long

Closed

HBASE-6295 Possible performance improvement in client batch operations: presplit and send in background

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Michael Stack

Votes:: 0 Vote for this issue

Watchers:: 26 Start watching this issue

Dates

Created:: 12/Apr/13 19:33

Updated:: 16/Jun/22 16:46

Resolved:: 16/Jun/22 16:46