Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-24303

SourceCoordinator exception may fail Session Cluster

    XMLWordPrintableJSON

Details

    Description

      The SourceCoordinator currently forwards all exceptions from `Source#createEnumerator` up the stack triggering a JobMaster failover. However, JobMaster failover only works if HA is enabled[1]. If HA is not enabled the fatal error handler will simply exit the JM process killing the entire cluster. This is problematic in the case of a session cluster where there may be multiple jobs running. It also does not play well with external tooling that does not expect job failure to cause a full cluster failure. 

       

      It would be preferable if failure to create an enumerator did not take down the entire cluster, but instead failed that particular job. 

       

      [1] https://github.com/apache/flink/blob/7f69331294ab2ab73f77b40a4320cdda53246afe/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L898-L903

      Attachments

        Issue Links

          Activity

            People

              sewen Stephan Ewen
              sjwiesman Seth Wiesman
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: