Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-3721

Subsidiaries freeze in the status of "RUNNING" during a high load on the cluster

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Blocker
    • Resolution: Unresolved
    • 5.2.0
    • None
    • core
    • None

    Description

      When my cluster is loaded, I am faced with the problem of hanging subsidiaries in the status of "RUNNING". I get such a mistake when working with the HIVE tables. But also, I managed to reproduce the problem, launching the usual calculation of the number of pi in many subsidiaries, imitating the load.

      I launch an Oozie workflow with the following structure:

      -- Oozie workflow
      ------> subworkflow_1
      ---------- fork_1
      ---------- fork_2
      ---------- ...
      ---------- fork_n
      ------> subworkflow_2
      ---------- fork_1
      ---------- fork_2
      ---------- ...
      ---------- fork_n 

      One of the fork have status "RUNNING" but if you open this fork, then it has "SUCCESS" status.

      Parent workflow:

      Job ID : 0061971-240125161152217-oozie-oozi-W
      ------------------------------------------------------------------------------------------------------------------------
      Workflow Name : test-subworkflow
      App Path      : hdfs://mycluster:8020/user/cecyl/subwf/job
      Status        : RUNNING
      Run           : 0
      User          : cecyl
      Group         : -
      Created       : 2024-01-25 15:55 GMT
      Started       : 2024-01-25 15:55 GMT
      Last Modified : 2024-01-30 06:24 GMT
      Ended         : -
      CoordAction ID: -Actions
      -------------------------------------------------------------------------------------------------------------------------
      ID                                                       Status    Ext ID                 Ext Status Err Code
      -------------------------------------------------------------------------------------------------------------------------
      0061971-240125161152217-oozie-oozi-W@:start:             OK        -                      OK         -
      -------------------------------------------------------------------------------------------------------------------------
      0061971-240125161152217-oozie-oozi-W@fork                OK        -                      OK         -
      -------------------------------------------------------------------------------------------------------------------------
      0061971-240125161152217-oozie-oozi-W@fork7               OK        0067643-240125161152217-oozie-oozi-WSUCCEEDED  -
      -------------------------------------------------------------------------------------------------------------------------
      0061971-240125161152217-oozie-oozi-W@fork9               OK        0067640-240125161152217-oozie-oozi-WSUCCEEDED  -
      -------------------------------------------------------------------------------------------------------------------------
      0061971-240125161152217-oozie-oozi-W@fork10              RUNNING   0067641-240125161152217-oozie-oozi-WRUNNING    -
      -------------------------------------------------------------------------------------------------------------------------
      0061971-240125161152217-oozie-oozi-W@fork5               OK        0067645-240125161152217-oozie-oozi-WSUCCEEDED  -
      -------------------------------------------------------------------------------------------------------------------------
       

      Running subworkflow:

      Job ID : 0067641-240125161152217-oozie-oozi-W
      ------------------------------------------------------------------------------------------------------------------------------------
      Workflow Name : test-subworkflow
      App Path      : hdfs://mycluster:8020/user/cecyl/subwf
      Status        : RUNNING
      Run           : 0
      User          : cecyl
      Group         : -
      Created       : 2024-01-26 04:20 GMT
      Started       : 2024-01-26 04:20 GMT
      Last Modified : 2024-01-26 08:23 GMT
      Ended         : -
      CoordAction ID: 0061971-240125161152217-oozie-oozi-WActions
      -------------------------------------------------------------------------------------------------------------------------
      ID                                                       Status    Ext ID                 Ext Status Err Code
      -------------------------------------------------------------------------------------------------------------------------
      0067641-240125161152217-oozie-oozi-W@:start:             OK        -                      OK         -
      -------------------------------------------------------------------------------------------------------------------------
      0067641-240125161152217-oozie-oozi-W@fork                OK        -                      OK         -
      -------------------------------------------------------------------------------------------------------------------------
      0067641-240125161152217-oozie-oozi-W@fork21              RUNNING   application_1706187939089_147514RUNNING    -
      -------------------------------------------------------------------------------------------------------------------------
      0067641-240125161152217-oozie-oozi-W@fork22              RUNNING   application_1706187939089_147519RUNNING    -
      -------------------------------------------------------------------------------------------------------------------------
      0067641-240125161152217-oozie-oozi-W@fork18              RUNNING   application_1706187939089_147518RUNNING    -
      -------------------------------------------------------------------------------------------------------------------------

      But, running app have state "SUCCEEDED" and "FINISHED"

      Application Report :
              Application-Id : application_1706187939089_147514
              Application-Name : oozie:launcher:T=shell:W=test-subworkflow:A=fork21:ID=0067641-240125161152217-oozie-oozi-W
              Application-Type : Oozie Launcher
              User : cecyl
              Queue : default
              Application Priority : 0
              Start-Time : 1706259786568
              Finish-Time : 1706259853156
              Progress : 100%
              State : FINISHED
              Final-State : SUCCEEDED 

      The problem began to appear more often after tuning the HA. Solving the problem - reducing the load and restarting the application. But such a solution is not normal for me.

      There are no signs in the laying and server logs that something is going wrong. Someone has ideas why such behavior can appear?

      Attachments

        Activity

          People

            Unassigned Unassigned
            cecylim Cecily Myles
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: