Running jobs with Kueue
As a data scientist or ML engineer, you can submit various types of workloads to Alauda Build of Kueue for quota-managed scheduling. This page shows how to run different workload types with Kueue.
Prerequisites
- The Alauda Build of Kueue cluster plugin is installed.
- A
ClusterQueue and LocalQueue have been configured by your administrator.
- The Alauda Container Platform Web CLI has communication with your cluster.
Identifying available local queues
Before submitting a job, identify the local queues available in your namespace:
kubectl get localqueues -n <your-namespace>
If a default local queue (named default) exists, you do not need to add the kueue.x-k8s.io/queue-name label to your workload. Otherwise, you must specify the local queue name.
Running a batch Job
To run a standard Kubernetes batch Job with Kueue, add the kueue.x-k8s.io/queue-name label to the Job manifest:
apiVersion: batch/v1
kind: Job
metadata:
generateName: sample-job-
namespace: team-ml
labels:
kueue.x-k8s.io/queue-name: team-ml-queue
spec:
parallelism: 3
completions: 3
template:
spec:
containers:
- name: worker
image: registry.k8s.io/e2e-test-images/agnhost:2.53
args: ["entrypoint-tester", "hello", "world"]
resources:
requests:
cpu: 1
memory: "200Mi"
restartPolicy: Never
kueue.x-k8s.io/queue-name: Specifies the LocalQueue that manages this Job. Replace team-ml-queue with the name of a LocalQueue in your namespace.
Submit the Job:
kubectl create -f job.yaml
Running a RayJob
To run a Ray-based distributed job with Kueue, add the kueue.x-k8s.io/queue-name label to the RayJob manifest:
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: ray-training-job
namespace: team-ml
labels:
kueue.x-k8s.io/queue-name: team-ml-queue
spec:
entrypoint: python /home/ray/train.py
runtimeEnvYAML: |
pip:
- torch
- transformers
rayClusterSpec:
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.9.0-py310-gpu
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "2"
memory: "4Gi"
workerGroupSpecs:
- replicas: 2
minReplicas: 2
maxReplicas: 2
groupName: gpu-workers
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.9.0-py310-gpu
resources:
requests:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
kueue.x-k8s.io/queue-name: Specifies the LocalQueue for this RayJob. Kueue will admit the entire RayJob (head + workers) as a single unit using gang scheduling.
Running a RayCluster
To create a Ray cluster managed by Kueue:
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: ray-cluster
namespace: team-ml
labels:
kueue.x-k8s.io/queue-name: team-ml-queue
spec:
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.9.0-py310-gpu
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "2"
memory: "4Gi"
workerGroupSpecs:
- replicas: 2
minReplicas: 2
maxReplicas: 2
groupName: gpu-workers
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.9.0-py310-gpu
resources:
requests:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
kueue.x-k8s.io/queue-name: The RayCluster will not be created until Kueue admits it based on available quota.
Running a PyTorchJob
To run a distributed PyTorch training job with Kueue:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-training
namespace: team-ml
labels:
kueue.x-k8s.io/queue-name: team-ml-queue
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
command:
- python
- -m
- torch.distributed.launch
- --nproc_per_node=1
- train.py
resources:
requests:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
restartPolicy: OnFailure
Worker:
replicas: 3
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
command:
- python
- -m
- torch.distributed.launch
- --nproc_per_node=1
- train.py
resources:
requests:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
restartPolicy: OnFailure
kueue.x-k8s.io/queue-name: Kueue will admit all replicas (Master + Workers) together using gang scheduling, ensuring the entire training job starts only when all required GPUs are available.
Monitoring your workloads
After submitting a workload, you can monitor its status:
-
Check if the workload was admitted:
kubectl get workloads -n <your-namespace>
-
View the position of your workload in the queue:
kubectl get --raw "/apis/visibility.kueue.x-k8s.io/v1beta2/namespaces/<your-namespace>/localqueues/<queue-name>/pendingworkloads"
-
Check the workload's admission status:
kubectl get workload <workload-name> -n <your-namespace> -o jsonpath='{.status.conditions}'