Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-8207

Replication could have data loss when machine name contains hyphen "-"

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.94.6, 0.95.0
    • 0.98.0, 0.94.7, 0.95.0
    • Replication
    • None
    • Reviewed

    Description

      In the recent test case TestReplication* failures, I'm finally able to find the cause(or one of causes) for its intermittent failures.

      When a machine name contains "-", it breaks the function ReplicationSource.checkIfQueueRecovered. It causes the following issue:

      deadRegionServers list is way off so that replication doesn't wait for log splitting finish for a wal file and move on to the next one(data loss)

      You can see that replication use those weird paths constructed from deadRegionServers to check a file existence

      2013-03-26 21:26:51,385 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,386 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,387 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,389 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,391 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,394 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,396 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      2013-03-26 21:26:51,398 INFO  [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125] regionserver.ReplicationSource(524): Possible location hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
      

      This happened in the recent test failure in http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false

      Search for

      File does not exist: hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
      

      After 10 times retries, replication source gave up and move on to the next file. Data loss happens.

      Since lots of EC2 machine names contain "-" including our Jenkin servers, this is a high impact issue.

      Attachments

        1. 8207-trunk-addendum.txt
          1 kB
          Ted Yu
        2. 8207_v3.patch
          9 kB
          Ted Yu
        3. hbase-8207_v2.patch
          9 kB
          Jeffrey Zhong
        4. hbase-8207-0.94-v1.patch
          9 kB
          Jeffrey Zhong
        5. hbase-8207_v2.patch
          9 kB
          Jeffrey Zhong
        6. hbase-8207_v1.patch
          9 kB
          Jeffrey Zhong
        7. hbase-8207.patch
          5 kB
          Jeffrey Zhong
        8. HBASE-8212-94.patch
          5 kB
          Jieshan Bean
        9. failed.txt
          5.06 MB
          Jeffrey Zhong

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jeffreyz Jeffrey Zhong
            jeffreyz Jeffrey Zhong
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment