Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-18913

Gossip NPE due to shutdown event corrupting empty statuses

    XMLWordPrintableJSON

Details

    Description

      When an instance either disables gossip or shuts down we send a gossip shutdown message, peers ignore it if the endpoint isn’t known, else it mutates its local copy of the state to mark shutdown…
      When an instance restarts it populates gossip with the endpoints found in peers, but the state is empty (not null)

      So, there is a fun timing bug…

      • stop node1
      • start node1; at this point all known endpoints before exist in gossip but are empty
      • node2 shutdown (gossip shutdown or node, doesn’t matter)
      • node1 sees the shutdown before gossip messages, and gets corruptted
      • node3 tries to join the cluster, fails due to node1 being corrupted

      There are 2 different patterns the NPE can happen with, in this example node1 and node3 will have different stack traces

      org.apache.cassandra.distributed.shared.ShutdownException: Uncaught exceptions were thrown during test
      	Suppressed: java.lang.NullPointerException: Unable to get HOST_ID; HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat: generation = 0, version = 2147483647, AppStateMap = {STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38), STATUS_WITH_PORT=Value(shutdown,true,36)}
      		at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218)
      		at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208)
      		at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279)
      		at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756)
      		at org.apache.cassandra.gms.Gossiper.markAsShutdown(Gossiper.java:611)
      		at org.apache.cassandra.gms.GossipShutdownVerbHandler.doVerb(GossipShutdownVerbHandler.java:39)
      		at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
      	Suppressed: java.lang.NullPointerException: Unable to get HOST_ID; HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat: generation = 0, version = 2147483647, AppStateMap = {STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38), STATUS_WITH_PORT=Value(shutdown,true,36)}
      		at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218)
      		at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208)
      		at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279)
      		at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756)
      		at org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1762)
      		at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3793)
      		at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1465)
      		at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1678)
      		at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
      		at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
      
      

      Attachments

        Issue Links

          Activity

            People

              dcapwell David Capwell
              dcapwell David Capwell
              David Capwell
              Brandon Williams
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m