NVIDIA has open-sourced KAI Scheduler, a Kubernetes solution designed to optimise the scheduling of GPU resources.
The solution, previously a core component of the Run:ai platform, is now available to the wider community under the Apache 2.0 licence. According to NVIDIA, the move is part of its ongoing commitment to the advancement of both open-source initiatives and enterprise-grade AI infrastructure.
KAI Scheduler is specifically engineered to tackle the unique challenges associated with managing AI workloads that utilise both GPUs and CPUs. Traditional resource schedulers often fall short in addressing the dynamic demands of modern AI development.
NVIDIA highlights several key advantages offered by KAI Scheduler:
- Managing fluctuating GPU demands: AI workloads are often characterised by their unpredictable resource requirements. A project might initially require a single GPU for tasks like data exploration, only to suddenly demand multiple GPUs for distributed training or parallel experiments. Traditional schedulers often struggle to adapt to such rapid changes. The KAI Scheduler addresses this by continuously recalculating fair-share values and dynamically adjusting quotas and limits in real-time, ensuring that GPU allocation aligns with current workload demands without the need for constant manual intervention from system administrators.
- Reduced wait times for compute access: For machine learning engineers, time is a critical factor. The KAI Scheduler minimises delays by employing a combination of gang scheduling, GPU sharing, and a hierarchical queuing system. This allows users to submit batches of jobs and be confident that these tasks will launch as soon as the necessary resources become available, while also respecting defined priorities and fairness principles.
- Resource guarantees or GPU allocation: In shared computing clusters, it is not uncommon for researchers to secure more GPUs than they immediately need, simply to guarantee availability later in the day. This practice can lead to significant underutilisation of resources, even when other teams have unmet computational demands. The KAI Scheduler tackles this issue by enforcing resource guarantees, ensuring that AI teams receive their allocated GPUs while also dynamically reallocating idle resources to other waiting workloads. This prevents resource hoarding and enhances overall cluster efficiency.
- Seamlessly connecting AI tools and frameworks: Integrating AI workloads with various AI frameworks can often be a complex undertaking. Traditionally, teams face a convoluted process of manual configurations to link workloads with popular tools such as Kubeflow, Ray, Argo, and the Training Operator. This complexity can significantly slow down the initial prototyping phase. The KAI Scheduler simplifies this process with a built-in podgrouper that automatically detects and connects with these tools and frameworks, thereby reducing configuration overhead and accelerating the development lifecycle.
Scheduling with KAI Scheduler
KAI Scheduler operates in a continuous loop, performing a series of crucial steps to ensure efficient resource allocation and management.
The process begins with taking a comprehensive snapshot of the Kubernetes cluster, encompassing the current state of GPUs, CPUs, nodes, podgroups, and queues. This snapshot is then used as the basis for dividing available resources according to defined fairness policies.
Following resource division, the scheduler executes a series of scheduling actions in a four-step order:
- Allocation: Pending jobs are evaluated based on their allocated-to-fair-share ratio. Jobs that can be accommodated by currently available resources are immediately bound to those resources. Jobs requiring resources that are in the process of being freed up are queued for allocation once those resources become available.
- Consolidation: For training workloads, the scheduler builds an ordered queue of remaining pending jobs. It then attempts to allocate these jobs by potentially moving currently running pods to different nodes. This process aims to minimise resource fragmentation, creating larger contiguous blocks of resources for the pending jobs.
- Reclamation: To maintain fairness across the cluster, the scheduler identifies queues that are consuming more resources than their calculated fair share. It then evicts selected jobs from these over-allocated queues based on predefined strategies, ensuring that under-served queues receive the resources they need.
- Preemption: Within the same queue, the scheduler can preempt lower-priority jobs in favour of higher-priority pending jobs. This ensures that critical workloads are not stalled due to resource contention from less important tasks.
KAI Scheduler revolves around two fundamental entities:
- Podgroups: These represent the atomic unit for scheduling. A podgroup consists of one or more interdependent pods that must be executed together as a single unit – a concept known as gang scheduling. This is particularly important for distributed AI training frameworks like TensorFlow or PyTorch. Key attributes of podgroups include the minimum number of members required to be scheduled together, their association with a specific scheduling queue, and their priority class, which determines their scheduling order relative to other podgroups.
- Queues: Queues form the foundation for enforcing resource fairness. Each queue has specific properties that govern its resource allocation, including a baseline quota (the guaranteed minimum resource allocation), an over-quota weight (which influences how surplus resources are distributed beyond the baseline), a limit (the maximum resources the queue can consume), and a queue priority (which determines its scheduling order relative to other queues).
NVIDIA provides an example scenario
To illustrate the KAI Scheduler’s capabilities, NVIDIA provides a scenario involving a cluster with three nodes, each equipped with eight GPUs, totalling 24 GPUs. Two projects, “project-a” (medium priority) and “project-b” (high priority), are running on this cluster.
Initially, various training and interactive jobs are running, consuming a portion of the available GPU resources. Subsequently, two new training jobs are submitted: Training Job A (requiring four contiguous GPUs, medium priority) and Training Job B (requiring three contiguous GPUs, high priority).
The scheduler first prioritises Training Job B due to its higher priority queue. Finding three contiguous free GPUs on Node 3, it allocates the job there. Next, it attempts to allocate Training Job A, but no single node has four contiguous free GPUs.
This triggers the consolidation phase. The scheduler identifies that by relocating Training Job 2 from Node 1 to Node 2, it can free up four contiguous GPUs on Node 1. This allows Training Job A to be successfully allocated to Node 1.
This example demonstrates how the KAI Scheduler can intelligently manage and consolidate resources to accommodate new workloads, even when contiguous resources are initially unavailable.
In more complex scenarios, the reclaim and preemption actions would be invoked if necessary to further optimise resource utilisation and fairness.
NVIDIA emphasises that the KAI Scheduler is not just a theoretical concept but a battle-tested component of the NVIDIA Run:ai platform, trusted by numerous enterprises for their critical AI operations.
The company is now inviting enterprises, startups, research institutions, and the wider open-source community to explore the KAI Scheduler in their own environments and share their experiences.
(Photo by Raphael Schaller)
See also: Eclipse Foundation unveils open-source AI development tools

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with other leading events including Intelligent Automation Conference, BlockX, Digital Transformation Week, and Cyber Security & Cloud Expo.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.