Description
We can stop or restart region servers gracefully using graceful_stop.sh command
This command should guarantee that all regions are moved out before shutting down a region server.
However, sometimes i saw many requests failed while restarting a region server with this command in our production clusters(v1.2.5)
affected clients got many RegionServerStoppedExceptions and exhausted retry count.
I found it took 0.03 sec to move a region, it’s too fast. and, moving(unloading) regions in the region server wasn’t finished, even didn’t closed yet when region server got shutdown signal.
Because a region server serving regions (didn't be closed) were stopped, clients got many exception (RegionServerStoppedException)
But, region_mover should wait until a region is served by other region server(meta changed)
https://github.com/apache/hbase/blob/branch-1.2/bin/region_mover.rb#L153
I figured out why this early shutdown happened.
a) our clusters use upper case hostname
b) region server makes ServerName with lowercase hostname, and it will be sent to the master
https://github.com/apache/hbase/blob/branch-1.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L542
c) when updating meta, server name will keep its own case
https://github.com/apache/hbase/blob/branch-1.2/hbase-client/src/main/java/org/apache/hadoop/hbase/MetaTableAccessor.java#L1527
d) region_mover.rb just compare b) and c), so it is always false
https://github.com/apache/hbase/blob/branch-1.2/bin/region_mover.rb#L91
https://github.com/apache/hbase/blob/branch-1.2/bin/region_mover.rb#L52
I think region_mover should compare server name between master and meta with the same case(lower)
With patch, I confirmed region_mover waited until finishing moving all regions, then triggered shutting down region sever. (also observed only RegionMovedException before shutdown log, and no exception after starting shutdown)