Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-17231

HA: Safemode should exit when resources are from low to available

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.4, 3.3.6
    • 3.4.0
    • ha
    • Reviewed

    Description

      The NameNodeResourceMonitor automatically enters safe mode when it detects that the resources are not sufficient. When zkfc detects insufficient resources, it triggers failover. Consider the following scenario:

      • Initially, nn01 is active and nn02 is standby. Due to insufficient resources in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the resource issue and puts nn01 into safemode. Subsequently, zkfc triggers failover.
      • At this point, nn01 is in safemode (ON) and standby, while nn02 is in safemode (OFF) and active.
      • After a period of time, the resources in nn01's dfs.namenode.name.dir recover, causing a slight instability and triggering failover again.
      • Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode (OFF) and standby.
      • However, since nn01 is active but in safemode (ON), hdfs cannot be read from or written to.

      reproduction

      1. Increase the dfs.namenode.resource.du.reserved
      2. Increase the ha.health-monitor.check-interval.ms can avoid directly switching to standby and stopping the NameNodeResourceMonitor thread. Instead, it is necessary to wait for the NameNodeResourceMonitor to enter safe mode before switching to standby.
      3. On the nn01 active node, using the dd command to create a file that exceeds the threshold, triggering a low on available disk space condition. 
      4. If the nn01 namenode process is not dead, the situation of nn01 safemode (ON) and standby occurs.

      Attachments

        Issue Links

          Activity

            People

              kuper kuper
              kuper kuper
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: