Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-3670

Actions can stuck while running in a Fork-Join workflow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 5.2.1
    • 5.3.0
    • core
    • None

    Description

      Fork node splits one path of execution into multiple concurrent paths of execution and the join node waits until every concurrent execution path of a previous fork node arrives to it. Given a scenario, when one of the paths [action] fails for some exotic reason - in our case (see attachment) with an EL Error - then the workflow job itself will fail as well, however the other actions running parallelly under the same workflow job will stuck in RUNNING state until they are purged, which can lead to Oozie slow-down in extreme cases.

      This behaviour can be reproduced using the attached forkjoin.xmljob.properties,  and helloworld.sh.
      In the above workflow, [action2] will fail due to ELError because

      <value>${variableThatWillCauseELError}</value> 

      could not be evaluated, but at the same time [action1] tries to complete itself but remains in RUNNING state.

      We have examined the situation at surface level, but we need to get a deeper understanding regarding the mechanism of fork-join workflows to proceed further.

      Suspected classes are for starting point:

      • org.apache.oozie.workflow.lite.LiteWorkflowInstance
      • org.apache.oozie.command.wf.ActionCheckXCommand
      • what if we do not throw Exception in org.apache.oozie.command.wf.ActionCheckXCommand#verifyPrecondition ?

      Attachments

        1. OOZIE-3670-001.patch
          27 kB
          János Makai
        2. job.properties
          0.1 kB
          János Makai
        3. helloworld.sh
          0.0 kB
          János Makai
        4. forkjoin.xml
          2 kB
          János Makai

        Activity

          People

            jmakai János Makai
            jmakai János Makai
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: