Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-34006

Flink terminates the execution of an application when there is a network problem between TaskManagers

Agile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Flink terminates an application when two TaskManager are disconnected although there are enough resources in the cluster to run the application and we use checkpoint restart.

       

      We deploy Flink v(1.17.1) on a cluster of six nodes with Ubuntu 18.04, the cluster consists of a JobManager and five TaskManagers. We use Flink's Standalone resource manager. We set the number of slots per TaskManager to one, and submit a WordCount application with a level of parallelism equal to three. We enable Flink checkpointing and restart failover strategy to attempt a restart in case of failure three times before termination and the time between attempts to 10 seconds.

      The application starts running on the first 3 TaskManager.

      If the communication is broken between two of the TaskManager that run the application, the job fails, and the JobManager tries to restart the job again. When the job fails the resources on the TaskManager are free. When the JobManager restarts the job, it selects the same three TaskManager it choose in the first attempt, and the job fails again. After three trials, Flink terminates the job with an exception: Connecting to remote task manager has failed.

       

      These are the JobManager Configurations:

      • taskmanager.numberOfTaskSlots: 1
      • Enable checkpointing: --checkpointing
      • execution.checkpointing.interval: 3min
      • Enabling restart failover strategy
      • restart-strategy.type: fixed-delay
      • restart-strategy.fixed-delay.attempts: 3
      • restart-strategy.fixed-delay.delay: 10 s

       

      command: ./bin/flink run -p 3 examples/streaming/WordCount.jar --checkpointing --input ~/flink/alice.txt

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            sophia1997 Sophia

            Dates

              Created:
              Updated:

              Slack

                Issue deployment