Uploaded image for project: 'Apache Curator'
  1. Apache Curator
  2. CURATOR-620

Double Leadership Issue while using Leader Latch Recipe

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.5.0, 5.2.0
    • None
    • Recipes
    • None
    • Production

    Description

      While using Curator Leader Latch Recipe in our application,  we observed a potential issue where two clients have become a leader (Double Leadership Issue).

      Quick summary of below description

      • Our use case explained
      • Issue details
      • Timeline of events mentioned
      • Attached test code to reproduce the reported issue
      • Possible solution given, where we need your suggestions 

      Our use case:

      • Two clients trying to get the leadership using Curator Leader Latch Recipe. On LeaderLatchListener.isLeader() Client would become a leader and on LeaderLatchListener.notLeader() Client would lose its leadership

      Issue details:

      • One of the clients on receiving two CuratorConnectionListener RECONNECTED events in quick succession, we observed that LeaderLatch EventThreads interleave with each other, resulting in "latch node deletion" happen after "client becoming a leader", thereby the client will still be a leader though its corresponding latch node has been deleted
      • And the other client who tried to get leadership creates its latch node and sees itself in first index and thus become a leader
      • So at this point, two clients have become a leader

      Timeline of events:

      • Timeline events of Client A whose corresponding latch node is deleted but still be a leader
        • At t1, 1st RECONNECTED event fired
        • At t2, [ EventThread of 1st RECONNECTED event ] Resets leadership (true -> false)
        • At t3, [ EventThread of 1st RECONNECTED event ] Fire “listener.notLeader()”
        • At t4, [ EventThread of 1st RECONNECTED event ] Deletes latch node
        • At t5, [ EventThread of 1st RECONNECTED event ] Creates new latch node
        • At t6, 2nd RECONNECTED event fired
        • At t7, [ EventThread of 2nd RECONNECTED event ] Resets leadership (false -> false), Basically NOP
        • At t8, [ EventThread of 2nd RECONNECTED event ] Fire nothing. Basically NOP
        • At t9, [ EventThread of 1st RECONNECTED event ] Get children -> sort them -> check leadership -> Set leadership to true -> Fire “Has become a leader” leader listener event
        • At t10, [ EventThread of 2nd RECONNECTED event ] Delete latch node (which actually deletes the latch node with which the Client A has become a leader through previous step)
      • Timeline events of Client B who also become a leader
        • At t11, Client B creates its latch node -> Get children -> sort them -> check leadership -> Set leadership to true -> Fire “Has become a leader” leader listener event

      This ends up in a situation where both Client A and Client B have become a leader

      As we observe, over the period (t8 -> t10), Client A’s LeaderLatch EventThreads interleave with each other causing leadership latch node deleted but local state still assumes that it’s a leader

      Reproducing the issue:

      Possible Solution (where we would like to hear your thoughts/suggestions):

      • The current curator code during reset() does
        • setLeadership(false) first followed by
        • setNode(null) (i.e. deleting its latch node)
      • Swapping these two should resolve the issue, as we setting leadership to false once after its latch node gets deleted
        • setNode(null) (i.e. deleting its latch node) first followed by
        • setLeadership(false)

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            Viswa Viswanathan Rajagopal
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: