[HBASE-21565] Delete dead server from dead server list too early leads to concurrent Server Crash Procedures(SCP) for a same server - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.0.0-alpha-1
Fix Version/s: 3.0.0-alpha-1, 2.2.0
Component/s: amv2
Labels:
None

Description

There are 2 kinds of SCP for a same server will be scheduled during cluster restart, one is ZK session timeout, the other one is new server report in will cause the stale one do fail over. The only barrier for these 2 kinds of SCP is check if the server is in the dead server list.

    if (this.deadservers.isDeadServer(serverName)) {
      LOG.warn("Expiration called on {} but crash processing already in progress", serverName);
      return false;
    }

But the problem is when master finish initialization, it will delete all stale servers from dead server list. Thus when the SCP for ZK session timeout come in, the barrier is already removed.
Here is the logs that how this problem occur.

2018-12-07,11:42:37,589 INFO org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=9, state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false
2018-12-07,11:42:58,007 INFO org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=444, state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false

Now we can see two SCP are scheduled for the same server.
But the first procedure is finished after the second SCP starts.

2018-12-07,11:43:08,038 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=9, state=SUCCESS, hasLock=false; ServerCrashProcedure server=c4-hadoop-tst-st27.bj,29100,1544153846859, splitWal=true, meta=false in 30.5340sec

Thus it will leads the problem that regions will be assigned twice.

2018-12-07,12:16:33,039 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: rit=OPEN, location=c4-hadoop-tst-st28.bj,29100,1544154149607, table=test_failover, region=459b3130b40caf3b8f3e1421766f4089 reported OPEN on server=c4-hadoop-tst-st29.bj,29100,1544154149615 but state has otherwise

And here we can see the server is removed from dead server list before the second SCP starts.

2018-12-07,11:42:44,938 DEBUG org.apache.hadoop.hbase.master.DeadServer: Removed c4-hadoop-tst-st27.bj,29100,1544153846859 ; numProcessing=3

Thus we should not delete dead server from dead server list immediately.
Patch to fix this problem will be upload later.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HBASE-21565.master.010.patch
17/Dec/18 11:33
19 kB
Jingyun Tian
HBASE-21565.master.009.patch
17/Dec/18 08:00
19 kB
Jingyun Tian
HBASE-21565.master.008.patch
17/Dec/18 06:32
19 kB
Jingyun Tian
HBASE-21565.master.007.patch
17/Dec/18 03:18
15 kB
Jingyun Tian
HBASE-21565.master.006.patch
14/Dec/18 10:28
15 kB
Jingyun Tian
HBASE-21565.master.005.patch
13/Dec/18 04:24
16 kB
Jingyun Tian
HBASE-21565.master.004.patch
12/Dec/18 07:12
11 kB
Jingyun Tian
HBASE-21565.master.003.patch
11/Dec/18 07:21
11 kB
Jingyun Tian
HBASE-21565.master.002.patch
10/Dec/18 03:27
11 kB
Jingyun Tian
HBASE-21565.master.001.patch
07/Dec/18 10:27
7 kB
Jingyun Tian
HBASE-21565.branch-2.002.patch
19/Dec/18 06:23
19 kB
Jingyun Tian
HBASE-21565.branch-2.001.patch
18/Dec/18 09:13
19 kB
Jingyun Tian

Issue Links

is related to

HBASE-22153 Fix the flaky TestRestartCluster

Resolved

links to

Review Board (master)

Activity

People

Assignee:: Jingyun Tian

Reporter:: Jingyun Tian

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 07/Dec/18 04:18

Updated:: 22/Nov/20 03:15

Resolved:: 04/Mar/19 09:00