Integrate with JobSet

This page shows how to use Alauda Build of Kueue to manage JobSet workloads. A JobSet allows you to define a group of related Kubernetes Jobs that are managed together as a single unit.

JobSet is useful for distributed workloads that consist of multiple job components, such as a driver job and multiple worker jobs, where all components must run together.

Prerequisites

  • You have installed the Alauda Build of Kueue.
  • You have installed the JobSet controller.
  • The JobSet framework is enabled in the Kueue configuration.
  • The Alauda Container Platform Web CLI has communication with your cluster.
  • You have created a ClusterQueue, ResourceFlavor, and LocalQueue.

Procedure

  1. Create a JobSet resource with the kueue.x-k8s.io/queue-name label:

    apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: training-jobset
      namespace: team-ml
      labels:
        kueue.x-k8s.io/queue-name: team-ml-queue
    spec:
      replicatedJobs:
      - name: workers
        replicas: 1
        template:
          spec:
            parallelism: 4
            completions: 4
            backoffLimit: 1
            template:
              spec:
                containers:
                - name: worker
                  image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
                  resources:
                    requests:
                      cpu: "2"
                      memory: "4Gi"
                      nvidia.com/gpu: "1"
                  command: ["python", "train.py"]
      - name: driver
        template:
          spec:
            parallelism: 1
            completions: 1
            backoffLimit: 0
            template:
              spec:
                containers:
                - name: driver
                  image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
                  resources:
                    requests:
                      cpu: "2"
                      memory: "4Gi"
                  command: ["python", "driver.py"]
    1. kueue.x-k8s.io/queue-name: Specifies the LocalQueue that manages this JobSet. Kueue admits all replicated jobs together as a single unit.
    2. workers: A replicated job that runs 4 parallel worker pods, each requesting 1 GPU.
    3. driver: A single driver pod that coordinates the workers.
  2. Apply the JobSet:

    kubectl apply -f training-jobset.yaml
  3. Monitor the JobSet admission:

    kubectl get workloads -n team-ml
  4. Check the JobSet status:

    kubectl get jobset training-jobset -n team-ml