Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.23.2
-
None
Description
This is an issue that we have been observing in the 1.23.2 version of NiFi when we try upgrade,
Since Rolling upgrade is not supported in NiFi, we scale out the revision that is running and run a helm upgrade.
We have NIFI running in k8s cluster mode, there is a post job that call the Tenants and policies API. On a successful run it would run like this
set_policies() Action: 'read' Resource: '/flow' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER' set_policies() status: '200' 'read' '/flow' policy already exists. It will be updated... set_policies() fetching policy inside -eq 200 status: '200' set_policies() after update PUT: '200' set_policies() Action: 'read' Resource: '/tenants' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER' set_policies() status: '200'
This job was running fine in 1.23.0, 1.22 and other previous versions. In 1.23.2, we are noticing that the job is failing very frequently with the error logs;
set_policies() Action: 'read' Resource: '/flow' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER' set_policies() status: '200' 'read' '/flow' policy already exists. It will be updated... set_policies() fetching policy inside -eq 200 status: '200' set_policies() after update PUT: '400' An error occurred getting 'read' '/flow' policy: 'This node is disconnected from its configured cluster. The requested change will only be allowed if the flag to acknowledge the disconnected node is set.'
'This node is disconnected from its configured cluster. The requested change will only be allowed if the flag to acknowledge the disconnected node is set.'
The job is configured to run only after all the pods are up and running. Though the pods are up we see exception is the inside pods
org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption. at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1059) at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:667) at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:107) at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:396) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: org.apache.nifi.controller.serialization.FlowSynchronizationException: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:448) at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:206) at org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:42) at org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1530) at org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104) at org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:817) at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1028) ... 4 common frames omitted Caused by: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running at org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:310) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:700) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:405) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:543) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:427) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:266) at org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:550) at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:261) at org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3977) at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:439) ... 10 common frames omitted
Attaching screenshots of the UI as well. this issue is observed a lot checking with CLI command.
./cli.sh nifi cluster-summary -u https://nifi-headless.doc-norc.svc.cluster.local:9443 -ts /opt/nifi/cert_mgr/truststore.jks -tst jks -tsp changeit -ks /opt/nifi/cert_mgr/keystore.j ks -kst jks -ksp changeit Total node count: 0 Connected node count: 0 Clustered: true Connected to cluster: false
We tried Workaround
1.Exec to the pod that has the flow file issue, delete the flow file so that it deletes from the PVC 2. Exit from pod 3. Delete the pod that had the problem
Pod will respwan, cluster coordinator will recreate the flowfile from the connected nodes
This connected all the nodes. But this does not feel like an ideal solution as we're seeing this issue quite often and cannot run this WA every time
we also see this Exception sometimes
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /nifi/leaders/Cluster Coordinator
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2480)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:243)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:232)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:94)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:229)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:220)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:42)
at org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:155)
at org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:135)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.getParticipantNodes(InterProcessMutex.java:170)
at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:336)
at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:281)
at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.verifyLeader(CuratorLeaderElectionManager.java:572)
at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.isLeader(CuratorLeaderElectionManager.java:526)
at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$LeaderRole.isLeader(CuratorLeaderElectionManager.java:467)
at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.isLeader(CuratorLeaderElectionManager.java:262)
at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.isActiveClusterCoordinator(NodeClusterCoordinator.java:824)
at org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor.monitorHeartbeats(AbstractHeartbeatMonitor.java:132)
at org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor$1.run(AbstractHeartbeatMonitor.java:84)
at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)