Example Kueue resource configurations for distributed workloads

This page provides example Kueue resource configurations for distributed workloads that use GPU accelerators. These examples demonstrate how to configure ResourceFlavor, ClusterQueue, and LocalQueue objects for different GPU scenarios.

NVIDIA GPUs without shared cohort

In this scenario, you have two types of NVIDIA GPU nodes and you want to configure separate queues for each GPU type without sharing resources between them.

ResourceFlavor for NVIDIA Tesla T4 GPU

apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: t4-flavor
spec:
  nodeLabels:
    nvidia.com/gpu.product: Tesla-T4
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

ResourceFlavor for NVIDIA A30 GPU

apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: a30-flavor
spec:
  nodeLabels:
    nvidia.com/gpu.product: NVIDIA-A30
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

ClusterQueue for Tesla T4 GPU nodes

apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: t4-cluster-queue
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "t4-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 4

ClusterQueue for A30 GPU nodes

apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: a30-cluster-queue
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "a30-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 32
      - name: "memory"
        nominalQuota: 128Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 8

NVIDIA GPUs with HAMi virtual GPU sharing

In this scenario, you use Alauda Build of HAMi to enable GPU sharing and slicing. Different GPU models are configured with HAMi-specific resource names.

ResourceFlavor for HAMi-managed Tesla T4 GPU

apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: t4-hami-flavor
spec:
  nodeLabels:
    nvidia.com/gpu.product: Tesla-T4

ResourceFlavor for HAMi-managed A100 GPU

apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: a100-hami-flavor
spec:
  nodeLabels:
    nvidia.com/gpu.product: A100-SXM4-80GB

ClusterQueue with multiple HAMi GPU flavors

apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: hami-cluster-queue
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["cpu", "memory", "pods"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 64
      - name: "memory"
        nominalQuota: 256Gi
      - name: "pods"
        nominalQuota: 50
  - coveredResources: ["nvidia.com/gpualloc", "nvidia.com/total-gpucores", "nvidia.com/total-gpumem"]
    flavors:
    - name: "t4-hami-flavor"
      resources:
      - name: "nvidia.com/gpualloc"
        nominalQuota: "8"
      - name: "nvidia.com/total-gpucores"
        nominalQuota: "800"
      - name: "nvidia.com/total-gpumem"
        nominalQuota: "65536"
    - name: "a100-hami-flavor"
      resources:
      - name: "nvidia.com/gpualloc"
        nominalQuota: "4"
      - name: "nvidia.com/total-gpucores"
        nominalQuota: "400"
      - name: "nvidia.com/total-gpumem"
        nominalQuota: "327680"
  1. t4-hami-flavor: Quotas for Tesla T4 GPU nodes managed by HAMi. Up to 8 virtual GPU allocations with a total of 800 GPU cores and 64Gi GPU memory.
  2. a100-hami-flavor: Quotas for A100 GPU nodes managed by HAMi. Up to 4 virtual GPU allocations with larger per-GPU memory (80Gi per GPU).

Mixed physical and virtual GPU management

In this scenario, some GPU nodes are managed by the NVIDIA GPU Device Plugin (physical GPUs) while others are managed by HAMi (virtual GPUs).

ResourceFlavor for physical GPU nodes

apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: pgpu-flavor
spec:
  nodeLabels:
    nvidia.com/gpu.product: NVIDIA-A30
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

ResourceFlavor for HAMi virtual GPU nodes

apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: vgpu-flavor
spec:
  nodeLabels:
    nvidia.com/gpu.product: Tesla-T4

ClusterQueue with both physical and virtual GPU flavors

apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: mixed-gpu-queue
spec:
  namespaceSelector: {}
  resourceGroups:
  - coveredResources: ["cpu", "memory", "pods"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 64
      - name: "memory"
        nominalQuota: 256Gi
      - name: "pods"
        nominalQuota: 50
  - coveredResources: ["nvidia.com/gpu"]
    flavors:
    - name: "pgpu-flavor"
      resources:
      - name: "nvidia.com/gpu"
        nominalQuota: 8
  - coveredResources: ["nvidia.com/gpualloc", "nvidia.com/total-gpucores", "nvidia.com/total-gpumem"]
    flavors:
    - name: "vgpu-flavor"
      resources:
      - name: "nvidia.com/gpualloc"
        nominalQuota: "16"
      - name: "nvidia.com/total-gpucores"
        nominalQuota: "1600"
      - name: "nvidia.com/total-gpumem"
        nominalQuota: "131072"
INFO

Note: Physical GPU resources (nvidia.com/gpu) and virtual GPU resources (nvidia.com/gpualloc, nvidia.com/total-gpucores, nvidia.com/total-gpumem) must be in separate resourceGroups because they are different resource types.

Restricting ClusterQueues to specific namespaces

By default, namespaceSelector: {} allows all namespaces to submit workloads to the ClusterQueue. To restrict access to specific namespaces, use matchLabels:

apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: team-ml-queue
spec:
  namespaceSelector:
    matchLabels:
      kubernetes.io/metadata.name: team-ml
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "a30-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 32
      - name: "memory"
        nominalQuota: 128Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 8
  1. matchLabels: Restricts this ClusterQueue to only accept workloads from the team-ml namespace. You can use any label that exists on the target namespace.