Troubleshooting common problems with distributed workloads for administrators
If users report errors related to distributed workloads in Alauda AI, read this section to understand what could be causing the problem and how to resolve it as an administrator.
If the problem is not documented here or in the release notes, contact Alauda Support.
TOC
Ray cluster is in a suspended stateProblemDiagnosisResolutionRay cluster is in a failed stateProblemDiagnosisResolutionRay cluster does not startProblemDiagnosisResolutionPyTorchJob is not being admittedProblemDiagnosisResolutionKueue webhook service has no endpointsProblemDiagnosisResolutionWorkload pod terminated before image pull completesProblemDiagnosisResolutionInsufficient resources across the cohortProblemDiagnosisResolutionRay cluster is in a suspended state
Problem
The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.
Diagnosis
The Ray cluster head pod or worker pods remain in a suspended state.
Resolution
- Check the Workload resource status:
- Inspect the Workload YAML for the detailed reason:
Check the
status.conditions.messagefield: - Check the ClusterQueue configuration:
- Verify that the requested resources are within the limits defined in the ClusterQueue:
- If the quota is insufficient, increase the
nominalQuotafor the relevant resource. - If the ResourceFlavor does not exist, create it.
- If the user requested more resources than available, ask them to reduce their request.
- If the quota is insufficient, increase the
Ray cluster is in a failed state
Problem
You might have insufficient resources or a misconfiguration.
Diagnosis
The Ray cluster head pod or worker pods are not running. When a Ray cluster is first created, it may initially enter a failed state. This usually resolves after the reconciliation process completes and the pods start running.
Resolution
If the failed state persists:
- Check the pod events:
- Check the RayCluster resource status:
Review the
status.conditions.messagefield. - Common causes:
- Insufficient node resources: The cluster does not have enough physical resources. Scale up the cluster or reduce the workload request.
- Image pull failure: The container image cannot be pulled. Check image registry access and image name.
- Scheduling failure: Nodes do not match the required labels or tolerations. Verify the ResourceFlavor configuration.
Ray cluster does not start
Problem
After creating a Ray cluster, it remains in the Starting state and no pods are created.
Diagnosis
- Check the Workload resource:
- Inspect the
status.conditions.messagefield of both the Workload and RayCluster resources.
Resolution
- Verify the KubeRay operator pod is running:
- If the KubeRay operator pod is not running, restart it:
- Check the KubeRay operator logs for errors:
PyTorchJob is not being admitted
Problem
A PyTorchJob remains in a pending state and its pods are not created.
Diagnosis
- Check if a Workload was created for the PyTorchJob:
- If a Workload exists, check its status conditions:
Resolution
- Verify the PyTorchJob has the
kueue.x-k8s.io/queue-namelabel: - If the label is missing, add it to the PyTorchJob manifest.
- Verify the LocalQueue exists in the namespace and is backed by a ClusterQueue with sufficient quota.
- Ensure all resources requested by the PyTorchJob (CPU, memory, GPU) are covered in the ClusterQueue's
coveredResources.
Kueue webhook service has no endpoints
Problem
When creating distributed workloads, you see a 500 error about "failed to call webhook" with "no endpoints available for service".
Diagnosis
The Kueue controller manager pod is not running.
Resolution
- Check the Kueue pod status:
- If the pod is in
CrashLoopBackOffor not running, check the logs: - Restart the Kueue controller:
- Verify the webhook service endpoints are available:
Workload pod terminated before image pull completes
Problem
Kueue's waitForPodsReady timeout (default: 5 minutes) is too short for large container images commonly used in distributed workloads (e.g., CUDA images, large model images).
Diagnosis
- Check the pod events:
- Look for events indicating the image was still being pulled when the pod was terminated.
Resolution
- For workloads that use large images, add an
OnFailurerestart policy to the pod template so that partially pulled images can be reused: - Increase the
waitForPodsReadytimeout in the Alauda Build of Kueue deployment configuration. Contact Alauda Support for guidance on modifying this setting. - Pre-pull large images on GPU nodes to reduce image pull time.
Insufficient resources across the cohort
Problem
Distributed workloads are not being admitted even though other ClusterQueues in the same cohort have unused resources.
Diagnosis
The ClusterQueue might not be part of a cohort, or borrowing limits might be configured too restrictively.
Resolution
- Check if the ClusterQueue belongs to a cohort:
- If the ClusterQueue does not have a
spec.cohortfield, it cannot borrow resources. Add a cohort: - If borrowing limits are set, verify they allow sufficient borrowing for the workload's resource requirements.
- Check other ClusterQueues in the cohort to verify they have unused resources and their
lendingLimitallows lending.