Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-2099

[Umbrella] State initialisation simplification (phase 2)

    XMLWordPrintableJSON

Details

    Description

      Startup rebuilds all state of the cluster. This is called recovery. The name is a bit misleading as it is not really recovery as it is loading the current state. State initialisation is a better term to use.

      The current recovery code links the loading of applications and tasks (pods) to node loading. This makes the recovery code complex and thus fragile. It could, in a worst case scenario, lead to a pod not being recovered correctly.

      Recovery should be a step by step process that has boundaries and steps:

      • load node
        • register nodes with the core
      • load pods
        • create applications in core
        • register running pods as allocations with the core
        • register pending pods as asks with the core
      • process changes for nodes and pods
      • start scheduling

      No nodes, applications or asks on existing apps should be declined. Even if theĀ  queue does not exist a running application must be added and handled. The current rejection of an application if it cannot be placed in the queue is an incorrect behaviour.

      Attachments

        Issue Links

          Activity

            People

              ccondit Craig Condit
              ccondit Craig Condit
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: