Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-35159

CreatingExecutionGraph can leak CheckpointCoordinator and cause JM crash

    XMLWordPrintableJSON

Details

    Description

      When a task manager dies while the JM is generating an ExecutionGraph in the background then CreatingExecutionGraph#handleExecutionGraphCreation can transition back into WaitingForResources if the TM hosted one of the slots that we planned to use in tryToAssignSlots.

      At this point the ExecutionGraph was already transitioned to running, which implicitly kicks of periodic checkpointing by the CheckpointCoordinator, without the operator coordinator holders being initialized yet (as this happens after we assigned slots).

      This effectively leaks that CheckpointCoordinator, including the timer thread that will continue to try triggering checkpoints, which will naturally fail to trigger.
      This can cause a JM crash because it results in OperatorCoordinatorHolder#abortCurrentTriggering to be called, which fails with an NPE since the mainThreadExecutor was not initialized yet.

      java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: java.lang.NullPointerException
      	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$8(CheckpointCoordinator.java:707)
      	at java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
      	at java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
      	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
      	at java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:610)
      	at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:910)
      	at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:829)
      Caused by: java.util.concurrent.CompletionException: java.lang.NullPointerException
      	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
      	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
      	at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:932)
      	at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
      	... 7 more
      Caused by: java.lang.NullPointerException
      	at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.abortCurrentTriggering(OperatorCoordinatorHolder.java:388)
      	at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
      	at java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
      	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:985)
      	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:961)
      	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:693)
      	at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
      	... 8 more
      

      Attachments

        Issue Links

          Activity

            People

              chesnay Chesnay Schepler
              chesnay Chesnay Schepler
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: