Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-13184

Starting a TaskExecutor blocks the YarnResourceManager's main thread

    XMLWordPrintableJSON

Details

    Description

      Currently, YarnResourceManager starts all task executors in main thread. This could cause RM to become unresponsive when launching a large number of TEs (e.g. > 1000) because it involves blocking I/O operations (writing files to HDFS, communicating with the node manager using a synchronous NMClient). As a consequence, TE registration/heartbeat timeouts can occur and Flink might allocate too many excessive containers (see FLINK-12342) because it cannot process the YarnResourceManager#onContainersAllocated calls.

      There are different solution approaches but the end goal should be to not execute any blocking calls in the ResourceManager's main thread:

      1. Start the TaskExecutors from a different thread (potentially thread pool) which is responsible for uploading the files and communicating with the NodeManager
      2. Don't upload files (avoid blocking file system operations) and use the NMClientAsync for the communication with Yarn's NodeManager.
      3. Upload files in a separate I/O thread and use the NMClientAsync for the communication with Yarn's NodeManager.

      Attachments

        Issue Links

          Activity

            People

              wangyang0918 Yang Wang
              xtsong Xintong Song
              Votes:
              2 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m