[FLINK-30444] State recovery error not handled correctly and always causes JM failure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 1.16.0, 1.14.6, 1.15.3, 1.18.0, 1.17.1
Fix Version/s: None
Component/s: Client / Job Submission
Labels:
- pull-request-available
- stale-assigned

Description

When you submit a job in Application mode and you try to restore from an incompatible savepoint, there is a very unexpected behaviour.

Even with the following config:

execution.shutdown-on-application-finish: false
execution.submit-failed-job-on-application-error: true

The job goes into a FAILED state, and the jobmanager fails. In a kubernetes environment (when using the native kubernetes integration) this means that the JobManager is restarted automatically.

This will mean that if you have jobresult store enabled, after the JM comes back you will end up with an empty application cluster.

I think the correct behaviour would be, depending on the above mention config:

1. If there is a job recovery error and you have (execution.submit-failed-job-on-application-error) configured, then the job should show up as failed, and the JM should not exit (if execution.shutdown-on-application-finish is false)
2. If (execution.shutdown-on-application-finish is true) then the jobmanager should exit cleanly like on normal job terminal state and thus stop the deployment in Kubernetes, preventing a JM restart cycle

Attachments

Issue Links

links to

GitHub Pull Request #22051

Activity

People

Assignee:: David Morávek

Reporter:: Gyula Fora

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 16/Dec/22 17:34

Updated:: 17/Aug/23 14:35