Uploaded image for project: 'Apache NiFi'
  1. Apache NiFi
  2. NIFI-12232

Frequent "failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption"

    XMLWordPrintableJSON

Details

    Description

      This is an issue that we have been observing in the 1.23.2 version of NiFi when we try upgrade,

      Since Rolling upgrade is not supported in NiFi, we scale out the revision that is running and run a helm upgrade.

      We have NIFI running in k8s cluster mode, there is a post job that call the Tenants and policies API. On a successful run it would run like this

      set_policies() Action: 'read' Resource: '/flow' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER'
      set_policies() status: '200'
      'read' '/flow' policy already exists. It will be updated...
      set_policies() fetching policy inside -eq 200 status: '200'
      set_policies() after update PUT: '200'
      set_policies() Action: 'read' Resource: '/tenants' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER'
      set_policies() status: '200'

      This job was running fine in 1.23.0, 1.22 and other previous versions. In 1.23.2, we are noticing that the job is failing very frequently with the error logs;

      set_policies() Action: 'read' Resource: '/flow' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER'
      set_policies() status: '200'
      'read' '/flow' policy already exists. It will be updated...
      set_policies() fetching policy inside -eq 200 status: '200'
      set_policies() after update PUT: '400'
      An error occurred getting 'read' '/flow' policy: 'This node is disconnected from its configured cluster. The requested change will only be allowed if the flag to acknowledge the disconnected node is set.'

      'This node is disconnected from its configured cluster. The requested change will only be allowed if the flag to acknowledge the disconnected node is set.'

      The job is configured to run only after all the pods are up and running. Though the pods are up we see exception is the inside pods

      org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption.
      at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1059)
      at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:667)
      at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:107)
      at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:396)
      at java.base/java.lang.Thread.run(Thread.java:833)
      Caused by: org.apache.nifi.controller.serialization.FlowSynchronizationException: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running
      at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:448)
      at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:206)
      at org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:42)
      at org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1530)
      at org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104)
      at org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:817)
      at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1028)
      ... 4 common frames omitted
      Caused by: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running
      at org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:310)
      at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:700)
      at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:405)
      at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:543)
      at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:427)
      at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:266)
      at org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:550)
      at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:261)
      at org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3977)
      at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:439)
      ... 10 common frames omitted

      Attaching screenshots of the UI as well. this issue is observed a lot checking with CLI command.

      ./cli.sh nifi cluster-summary -u https://nifi-headless.doc-norc.svc.cluster.local:9443 -ts /opt/nifi/cert_mgr/truststore.jks -tst jks -tsp changeit -ks /opt/nifi/cert_mgr/keystore.j
      ks -kst jks -ksp changeit
      Total node count: 0
      Connected node count: 0
      Clustered: true
      Connected to cluster: false

       
      We tried Workaround

      1.Exec to the pod that has the flow file issue, delete the flow file so that it deletes from the PVC 
      2. Exit from pod
      3. Delete the pod that had the problem

      Pod will respwan, cluster coordinator will recreate the flowfile from the connected nodes
      This connected all the nodes. But this does not feel like an ideal solution as we're seeing this issue quite often and cannot run this WA every time

       

      we also see this Exception sometimes 

      org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /nifi/leaders/Cluster Coordinator
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
              at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
              at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2480)
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:243)
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:232)
              at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:94)
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:229)
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:220)
              at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:42)
              at org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:155)
              at org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:135)
              at org.apache.curator.framework.recipes.locks.InterProcessMutex.getParticipantNodes(InterProcessMutex.java:170)
              at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:336)
              at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:281)
              at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.verifyLeader(CuratorLeaderElectionManager.java:572)
              at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.isLeader(CuratorLeaderElectionManager.java:526)
              at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$LeaderRole.isLeader(CuratorLeaderElectionManager.java:467)
              at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.isLeader(CuratorLeaderElectionManager.java:262)
              at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.isActiveClusterCoordinator(NodeClusterCoordinator.java:824)
              at org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor.monitorHeartbeats(AbstractHeartbeatMonitor.java:132)
              at org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor$1.run(AbstractHeartbeatMonitor.java:84)
              at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)

      Attachments

        1. image-2023-10-16-16-12-31-027.png
          113 kB
          John Joseph
        2. image-2024-02-14-13-33-44-354.png
          50 kB
          René Zeidler

        Activity

          People

            markap14 Mark Payne
            scoutjohn John Joseph
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m