Troubleshooting common problems with Kueue

If you are experiencing errors in Alauda AI relating to Kueue workload management, read this section to understand what could be causing the problem, and how to resolve it.

If the problem is not documented here or in the release notes, contact Alauda Support.

I see a "failed to call webhook" error message

Problem

When creating or updating a workload (such as a Job, RayCluster, or InferenceService), you see an error similar to:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure",
"message":"Internal error occurred: failed calling webhook ... no endpoints available for service \"kueue-webhook-service\"",
"reason":"InternalError","code":500}

Diagnosis

The Kueue controller pod might not be running, or the webhook service has no available endpoints.

Resolution

Check the status of the Kueue controller pod:

kubectl get pods -n cpaas-system | grep kueue

If the pod is not running, check the pod events for errors:

kubectl describe pod -n cpaas-system -l app=kueue-controller-manager

Restart the Kueue controller pod if necessary:

kubectl delete pod -n cpaas-system -l app=kueue-controller-manager

Check the webhook service and its endpoints:

kubectl get endpoints -n cpaas-system kueue-webhook-service

I see a "Default Local Queue not found" error message

Problem

After submitting a workload, you see an error similar to:

Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found
please create a default Local Queue or provide the local_queue name in Cluster Configuration.

Diagnosis

No default local queue is defined in the namespace, and a local queue was not specified in the workload configuration.

Resolution

Resolve the problem in one of the following ways:

If a local queue exists in the namespace, add the kueue.x-k8s.io/queue-name label to your workload manifest:
```
metadata:
  labels:
    kueue.x-k8s.io/queue-name: <local_queue_name>
```

If no local queue exists, create a default local queue in the namespace:

apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  namespace: <your-namespace>
  name: default
spec:
  clusterQueue: <cluster-queue-name>

Contact your administrator to request a local queue be created for your namespace.

I see a "local_queue provided does not exist" error message

Problem

After submitting a workload, you see an error similar to:

local_queue provided does not exist or is not in this namespace.
Please provide the correct local_queue name in Cluster Configuration.

Diagnosis

An incorrect value is specified for the local queue, or the local queue exists in a different namespace.

Resolution

Verify the local queue exists in the correct namespace:
```
kubectl get localqueues -n <your-namespace>
```
Ensure the local queue name in the kueue.x-k8s.io/queue-name label matches exactly.
If no local queue exists in your namespace, contact your administrator to request one.

My workload is stuck in a suspended state

Problem

A workload (Job, RayCluster, InferenceService, etc.) remains in a Suspended or SchedulingGated state and its pods are not created.

Diagnosis

The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.

Resolution

Check the Workload resource status:

kubectl get workloads -n <your-namespace>

Inspect the Workload YAML for detailed status messages:

kubectl get workload <workload-name> -n <your-namespace> -o yaml

Check the status.conditions.message field, which provides the reason for the suspended state:

status:
  conditions:
    - message: "couldn't assign flavors to pod set main: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue"

Verify the ClusterQueue has sufficient quota:
```
kubectl get clusterqueues -o yaml
```
Either reduce the requested resources in your workload, or contact your administrator to increase the quota.

My workload pod is terminated before the image pull completes

Problem

Kueue waits for a period of time before marking a workload as ready for all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after this waiting period elapses, Kueue fails the workload and terminates the related pods.

Diagnosis

Check the events on the pod:

kubectl describe pod <pod-name> -n <your-namespace>

Look for events indicating the image pull was still in progress when the pod was terminated.

Resolution

To resolve this issue, use one of the following approaches:

Add an OnFailure restart policy to your workload pod template so that the pod restarts and the partially pulled image can be used:
```
spec:
  template:
    spec:
      restartPolicy: OnFailure
```
Contact your administrator to increase the waitForPodsReady timeout in the Kueue deployment configuration.

My ClusterQueue is not ready

Problem

A ClusterQueue exists but is not admitting any workloads.

Diagnosis

The ClusterQueue references a ResourceFlavor that does not exist.

Resolution

Check the ClusterQueue status:
```
kubectl get clusterqueues
```
Verify all referenced ResourceFlavors exist:
```
kubectl get resourceflavors
```
Create any missing ResourceFlavors. A ClusterQueue is not ready until all referenced ResourceFlavors are created.

Workloads are not being admitted in order

Problem

Workloads are not being admitted in the expected order (e.g., first-in-first-out).

Diagnosis

This can happen when workloads request different resource amounts, or when fair sharing and preemption policies are configured.

Resolution

Check the workload priorities:

kubectl get workloads -n <your-namespace> -o custom-columns=NAME:.metadata.name,PRIORITY:.spec.priority

Review the ClusterQueue's fair sharing weight and preemption configuration.

Use the Visibility API to check the pending workload order:

kubectl get --raw "/apis/visibility.kueue.x-k8s.io/v1beta2/clusterqueues/<queue-name>/pendingworkloads"

#Troubleshooting common problems with Kueue

#TOC

#I see a "failed to call webhook" error message

#Problem

#Diagnosis

#Resolution

#I see a "Default Local Queue not found" error message

#Problem

#Diagnosis

#Resolution

#I see a "local_queue provided does not exist" error message

#Problem

#Diagnosis

#Resolution

#My workload is stuck in a suspended state

#Problem

#Diagnosis

#Resolution

#My workload pod is terminated before the image pull completes

#Problem

#Diagnosis

#Resolution

#My ClusterQueue is not ready

#Problem

#Diagnosis

#Resolution

#Workloads are not being admitted in order

#Problem

#Diagnosis

#Resolution

Troubleshooting common problems with Kueue

TOC

I see a "failed to call webhook" error message

Problem

Diagnosis

Resolution

I see a "Default Local Queue not found" error message

Problem

Diagnosis

Resolution

I see a "local_queue provided does not exist" error message

Problem

Diagnosis

Resolution

My workload is stuck in a suspended state

Problem

Diagnosis

Resolution

My workload pod is terminated before the image pull completes

Problem

Diagnosis

Resolution

My ClusterQueue is not ready

Problem

Diagnosis

Resolution

Workloads are not being admitted in order

Problem

Diagnosis

Resolution