[NIFI-12232] Frequent "failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption" - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.23.2
Fix Version/s: 1.26.0, 2.0.0-M3
Component/s: Configuration Management
Labels:
None

Description

This is an issue that we have been observing in the 1.23.2 version of NiFi when we try upgrade,

Since Rolling upgrade is not supported in NiFi, we scale out the revision that is running and run a helm upgrade.

We have NIFI running in k8s cluster mode, there is a post job that call the Tenants and policies API. On a successful run it would run like this

set_policies() Action: 'read' Resource: '/flow' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER'
set_policies() status: '200'
'read' '/flow' policy already exists. It will be updated...
set_policies() fetching policy inside -eq 200 status: '200'
set_policies() after update PUT: '200'
set_policies() Action: 'read' Resource: '/tenants' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER'
set_policies() status: '200'

This job was running fine in 1.23.0, 1.22 and other previous versions. In 1.23.2, we are noticing that the job is failing very frequently with the error logs;

set_policies() Action: 'read' Resource: '/flow' entity_id: 'ad2d3ad6-5d69-3e0f-95e9-c7feb36e2de5' entity_name: 'CN=nifi-api-admin' entity_type: 'USER'
set_policies() status: '200'
'read' '/flow' policy already exists. It will be updated...
set_policies() fetching policy inside -eq 200 status: '200'
set_policies() after update PUT: '400'
An error occurred getting 'read' '/flow' policy: 'This node is disconnected from its configured cluster. The requested change will only be allowed if the flag to acknowledge the disconnected node is set.'

'This node is disconnected from its configured cluster. The requested change will only be allowed if the flag to acknowledge the disconnected node is set.'

The job is configured to run only after all the pods are up and running. Though the pods are up we see exception is the inside pods

org.apache.nifi.controller.serialization.FlowSynchronizationException: Failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption.
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1059)
at org.apache.nifi.controller.StandardFlowService.handleReconnectionRequest(StandardFlowService.java:667)
at org.apache.nifi.controller.StandardFlowService.access$200(StandardFlowService.java:107)
at org.apache.nifi.controller.StandardFlowService$1.run(StandardFlowService.java:396)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.apache.nifi.controller.serialization.FlowSynchronizationException: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running
at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:448)
at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.sync(VersionedFlowSynchronizer.java:206)
at org.apache.nifi.controller.serialization.StandardFlowSynchronizer.sync(StandardFlowSynchronizer.java:42)
at org.apache.nifi.controller.FlowController.synchronize(FlowController.java:1530)
at org.apache.nifi.persistence.StandardFlowConfigurationDAO.load(StandardFlowConfigurationDAO.java:104)
at org.apache.nifi.controller.StandardFlowService.loadFromBytes(StandardFlowService.java:817)
at org.apache.nifi.controller.StandardFlowService.loadFromConnectionResponse(StandardFlowService.java:1028)
... 4 common frames omitted
Caused by: java.lang.IllegalStateException: Cannot change destination of Connection because the current destination is running
at org.apache.nifi.connectable.StandardConnection.setDestination(StandardConnection.java:310)
at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.updateConnectionDestinations(StandardVersionedComponentSynchronizer.java:700)
at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:405)
at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronizeChildGroups(StandardVersionedComponentSynchronizer.java:543)
at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:427)
at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.lambda$synchronize$0(StandardVersionedComponentSynchronizer.java:266)
at org.apache.nifi.controller.flow.AbstractFlowManager.withParameterContextResolution(AbstractFlowManager.java:550)
at org.apache.nifi.flow.synchronization.StandardVersionedComponentSynchronizer.synchronize(StandardVersionedComponentSynchronizer.java:261)
at org.apache.nifi.groups.StandardProcessGroup.synchronizeFlow(StandardProcessGroup.java:3977)
at org.apache.nifi.controller.serialization.VersionedFlowSynchronizer.synchronizeFlow(VersionedFlowSynchronizer.java:439)
... 10 common frames omitted

Attaching screenshots of the UI as well. this issue is observed a lot checking with CLI command.

./cli.sh nifi cluster-summary -u https://nifi-headless.doc-norc.svc.cluster.local:9443 -ts /opt/nifi/cert_mgr/truststore.jks -tst jks -tsp changeit -ks /opt/nifi/cert_mgr/keystore.j
ks -kst jks -ksp changeit
Total node count: 0
Connected node count: 0
Clustered: true
Connected to cluster: false

We tried Workaround

1.Exec to the pod that has the flow file issue, delete the flow file so that it deletes from the PVC 
2. Exit from pod
3. Delete the pod that had the problem

Pod will respwan, cluster coordinator will recreate the flowfile from the connected nodes
This connected all the nodes. But this does not feel like an ideal solution as we're seeing this issue quite often and cannot run this WA every time

we also see this Exception sometimes

org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /nifi/leaders/Cluster Coordinator
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2480)
        at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:243)
        at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:232)
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:94)
        at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:229)
        at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:220)
        at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:42)
        at org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:155)
        at org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:135)
        at org.apache.curator.framework.recipes.locks.InterProcessMutex.getParticipantNodes(InterProcessMutex.java:170)
        at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:336)
        at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:281)
        at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.verifyLeader(CuratorLeaderElectionManager.java:572)
        at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.isLeader(CuratorLeaderElectionManager.java:526)
        at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$LeaderRole.isLeader(CuratorLeaderElectionManager.java:467)
        at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.isLeader(CuratorLeaderElectionManager.java:262)
        at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.isActiveClusterCoordinator(NodeClusterCoordinator.java:824)
        at org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor.monitorHeartbeats(AbstractHeartbeatMonitor.java:132)
        at org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor$1.run(AbstractHeartbeatMonitor.java:84)
        at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2023-10-16-16-12-31-027.png
16/Oct/23 10:42
113 kB
John Joseph
image-2024-02-14-13-33-44-354.png
14/Feb/24 12:33
50 kB
René Zeidler

Issue Links

links to

GitHub Pull Request #8406

GitHub Pull Request #8432

Frequent "failed to connect node to cluster because local flow controller partially updated. Administrator should disconnect node and review flow for corruption"

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Time Tracking