Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-34416

"Local recovery and sticky scheduling end-to-end test" still doesn't work with AdaptiveScheduler

    XMLWordPrintableJSON

Details

    • Technical Debt
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.19.0, 1.18.1, 1.20.0
    • None
    • Runtime / Coordination

    Description

      We tried to enable all AdaptiveScheduler-related tests in FLINK-34409 because it appeared that all Jira issues that were referenced are resolved. That's not the case for the "Local recovery and sticky scheduling end-to-end test" tests, though.

      With the AdaptiveScheduler being enabled, we run into issues where the test runs forever due to a NullPointerException continuously triggering a failure:

      Feb 07 19:02:59 2024-02-07 19:02:21,706 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Flat Map -> Sink: Unnamed (3/4) (54075d3d22edb729e5f396726f777860_20ba6b65f97481d5570070de90e4e791_2_16292) switched from INITIALIZING to FAILED on localhost:40893-09ff7>
      Feb 07 19:02:59 java.lang.NullPointerException: Expected to find info here.
      Feb 07 19:02:59         at org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:76) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.streaming.tests.StickyAllocationAndLocalRecoveryTestJob$StateCreatingFlatMap.initializeState(StickyAllocationAndLocalRecoveryTestJob.java:340) ~[?:?]
      Feb 07 19:02:59         at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:187) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:169) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:134) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:285) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateAndGates(StreamTask.java:799) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$restoreInternal$3(StreamTask.java:753) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:753) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:712) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:751) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) ~[flink-dist-1.20-SNAPSHOT.jar:1.20-SNAPSHOT]
      Feb 07 19:02:59         at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_402]
      

      This error is caused by a Precondition in StickyAllocationAndLocalRecoveryTestJob:340

      Attachments

        Activity

          People

            Unassigned Unassigned
            mapohl Matthias Pohl
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: