Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14503

ThrottledAsyncChecker throws NPE during block pool initialization

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 3.3.0
    • None
    • None
    • None

    Description

      ThrottledAsyncChecker throws NPE during block pool initialization. The error leads the block pool registration failure.

      The exception

      2019-05-20 01:02:36,003 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected exception in block pool Block pool <registering> (Datanode Uuid xxxxx) service to xx.xx.xx.xx/xx.xx.xx.xx
      java.lang.NullPointerException
              at org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker$LastCheckResult.access$000(ThrottledAsyncChecker.java:211)
              at org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker.schedule(ThrottledAsyncChecker.java:129)
              at org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker.checkAllVolumes(DatasetVolumeChecker.java:209)
              at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3387)
              at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1508)
              at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:319)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:272)
              at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:768)
              at java.lang.Thread.run(Thread.java:745)
      

      Looks like this error due to WeakHashMap type map completedChecks has removed the target entry while we still get that entry. Although we have done a check before we get it, there is still a chance the entry is got as null.

      We met a corner case for this: A federation mode, two block pools in DN, ThrottledAsyncChecker schedules two same health checks for same volume.

      2019-05-20 01:02:36,000 INFO org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for /hadoop/2/hdfs/data/current
      2019-05-20 01:02:36,000 INFO org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker: Scheduling a check for /hadoop/2/hdfs/data/current
      

      completedChecks cleans up the entry for one successful check after called completedChecks#get. However, after this, another check we get the null.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            linyiqun Yiqun Lin
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment