Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26347

Support detect and exclude slow DNs in fan-out of WAL

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0, 3.0.0-alpha-2
    • 2.5.0, 3.0.0-alpha-3
    • wal
    • None
    • Hide
      This issue provides the method to detect slow datanodes by checking the packets processing time of each datanode connected by the WAL. When a datanode is considered slow, the datanode will be added to an exclude cache on the regionserver, and every stream created will exclude all the cached slow datanodes in a configured period. The exclude logic cooperate with the log rolling logic, will react more sensitively to the lower slow datanodes, whatever there is hardware failure or hotspots.

      hbase.regionserver.async.wal.max.exclude.datanode.count(default 3)and hbase.regionserver.async.wal.exclude.datanode.info.ttl.hour (default 6) means no more than 3 slow datanodes will be excluded on one regionserver, and the exclude cache for the slow datanodes is valid in 6 hours.

      There are two conditions used to determine whether a datanode is slow,
      1. For small packet, we just have a simple time limit(configured by hbase.regionserver.async.wal.datanode.slow.packet.process.time.millis, default 6s), without considering the size of the packet.

      2. For large packet, we will calculate the speed, and check if the speed (configured by hbase.regionserver.async.wal.datanode.slow.packet.speed.min.kbs, default 20KB/s) is too slow.

      The large and small split point is configured by hbase.regionserver.async.wal.datanode.slow.check.speed.packet.data.length.min (default 64KB).
      Show
      This issue provides the method to detect slow datanodes by checking the packets processing time of each datanode connected by the WAL. When a datanode is considered slow, the datanode will be added to an exclude cache on the regionserver, and every stream created will exclude all the cached slow datanodes in a configured period. The exclude logic cooperate with the log rolling logic, will react more sensitively to the lower slow datanodes, whatever there is hardware failure or hotspots. hbase.regionserver.async.wal.max.exclude.datanode.count(default 3)and hbase.regionserver.async.wal.exclude.datanode.info.ttl.hour (default 6) means no more than 3 slow datanodes will be excluded on one regionserver, and the exclude cache for the slow datanodes is valid in 6 hours. There are two conditions used to determine whether a datanode is slow, 1. For small packet, we just have a simple time limit(configured by hbase.regionserver.async.wal.datanode.slow.packet.process.time.millis, default 6s), without considering the size of the packet. 2. For large packet, we will calculate the speed, and check if the speed (configured by hbase.regionserver.async.wal.datanode.slow.packet.speed.min.kbs, default 20KB/s) is too slow. The large and small split point is configured by hbase.regionserver.async.wal.datanode.slow.check.speed.packet.data.length.min (default 64KB).

    Description

      We all knows the WAL sync performance directly affects the RPC process time.

      And we use self-designed FanOutOneBlockAsyncDFSOutput to sync WAL entries, which connect straightly to all the block located DNs. But when even one DN of the locations is slow, e.g. some disk hardware failures, the WAL syncs slow. And what's more, the hardware failure detected by the lower layer HDFS system is not so sensitive.

      We can detect slow DNs by the ACK time of packets in FanOutOneBlockAsyncDFSOutput, and exclude them when add new blocks after log rolled(rolling log can also be triggered by slow syncs). And shows this info in UI. We can also invalid these excluded DN cache after a duration, to aware the recovery of those DNs. 

      I think this idea can quickly reduce the influence of slow DNs, and improve the service availability.

       

       

      Attachments

        Activity

          People

            Xiaolin Ha Xiaolin Ha
            Xiaolin Ha Xiaolin Ha
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: