Configuring quota management for distributed workloads
As an administrator, you can configure quota management for distributed workloads by creating the required Kueue resources: ResourceFlavor, ClusterQueue, and LocalQueue. These resources control how distributed workloads (such as Ray jobs and PyTorch training jobs) consume cluster resources.
TOC
PrerequisitesProcedure1. Create a ResourceFlavor2. Create a ClusterQueue3. Create a LocalQueue4. Verify the configurationUsing HAMi virtual GPU resourcesNext stepsPrerequisites
- You have cluster administrator permissions.
- The Alauda Build of Kueue cluster plugin is installed.
- The KubeRay operator is installed (for Ray-based distributed workloads).
- The Alauda Container Platform Web CLI has communication with your cluster.
Procedure
1. Create a ResourceFlavor
A ResourceFlavor represents a set of node resources. For distributed workloads that require GPUs, create a ResourceFlavor with node labels that match your GPU nodes.
Example: ResourceFlavor for GPU nodes
nodeLabels: Targets nodes with a specific GPU model. Workloads admitted with this flavor are automatically scheduled to matching nodes.tolerations: Allows workloads to be scheduled on GPU-tainted nodes. Add tolerations that match the taints on your GPU nodes.
Apply the ResourceFlavor:
Check existing ResourceFlavors:
2. Create a ClusterQueue
A ClusterQueue defines the total resource quota available for distributed workloads.
Example: ClusterQueue for distributed workloads with GPU resources
namespaceSelector: {}: An empty selector allows all namespaces to use this ClusterQueue. To restrict access to specific namespaces, usematchLabels.- General compute resources: CPU, memory, and pod count quotas for the distributed workload infrastructure (head nodes, driver pods, etc.).
- GPU resources: The total number of GPUs available for distributed workloads. Adjust based on your cluster capacity.
Note: Every resource that a distributed workload might request must be listed in coveredResources with a nominalQuota value (even if 0). If a workload requests a resource that is not covered, it will not be admitted.
Apply the ClusterQueue:
3. Create a LocalQueue
A LocalQueue allocates resources from the ClusterQueue to a specific namespace where distributed workloads will run.
Example: LocalQueue for a team namespace
Apply the LocalQueue:
Verify the LocalQueue:
4. Verify the configuration
-
Check that the ClusterQueue is active:
The ClusterQueue should show as
Active. -
Check that the LocalQueue is connected:
Using HAMi virtual GPU resources
If you use Alauda Build of HAMi for GPU virtualization and sharing, configure the ClusterQueue with HAMi-specific resource names:
Note: If you use Alauda Build of HAMi, use nvidia.com/gpualloc, nvidia.com/total-gpucores, and nvidia.com/total-gpumem as the GPU resource names. If you use the Alauda Build of NVIDIA GPU Device Plugin (physical GPUs), use nvidia.com/gpu instead.
Next steps
- Configure example GPU configurations for multi-GPU scenarios.
- Set up fair sharing and cohorts to share resources between teams.
- See Running Ray-based distributed workloads for user-facing instructions on submitting workloads.