It's a classic bug, sort of... the call times out to open the region, but RS actually processes it alright. It could also happen if the response didn't make it back due to a network issue.
As a result region is opened on two servers.
There are some mitigations possible to narrow down the race window.
1) Don't process expired open calls, fail them. Won't help for network issues.
2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that will require fixing other network races where master kills RS, which would require adding state versioning to the protocol.
The fundamental fix though would require either
1) an unknown failure from open to ascertain the state of the region from the server. Again, this would probably require protocol changes to make sure we ascertain the region is not opened, and also that the already-failed-on-master open is NOT going to be processed if it's some queue or even in transit on the network (via a nonce-like mechanism)?
2) some form of a distributed lock per region, e.g. in ZK
3) some form of 2PC? but the participant list cannot be determined in a manner that's both scalable and guaranteed correct. Theoretically it could be all RSes.
The 2nd assignment
======= by Duo Zhang ======
The actual problem here is that, in IPCUtil.wrapException, we want to add the remote server address in the exception message so it will be easier for debugging, and there are several instanceof checks in it which is for keeping the original exception type, since upper layer may depend on the exception type for error recovery. But we do not check for CallTimeoutException in this method so it will be wrapped by an IOException, which makes the code in RSProcedureDispatcher broken, and causes the double assign.