Integrate with LeaderWorkerSet

This page shows how to use Alauda Build of Kueue to manage LeaderWorkerSet (LWS) workloads. A LeaderWorkerSet is a Kubernetes API that enables a common deployment pattern where a single leader coordinates multiple workers.

LeaderWorkerSet is particularly useful for distributed inference and training scenarios where a leader process manages the execution of multiple worker processes.

Prerequisites

  • You have installed the Alauda Build of Kueue.
  • You have installed the LeaderWorkerSet controller.
  • The LeaderWorkerSet framework is enabled in the Kueue configuration.
  • The Alauda Container Platform Web CLI has communication with your cluster.
  • You have created a ClusterQueue, ResourceFlavor, and LocalQueue.

Procedure

  1. Create a LeaderWorkerSet resource with the kueue.x-k8s.io/queue-name label:

    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    metadata:
      name: lws-training
      namespace: team-ml
      labels:
        kueue.x-k8s.io/queue-name: team-ml-queue
    spec:
      replicas: 2
      leaderWorkerTemplate:
        size: 3
        restartPolicy: RecreateGroupOnPodRestart
        leaderTemplate:
          metadata: {}
          spec:
            containers:
            - name: leader
              image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
              resources:
                requests:
                  cpu: "2"
                  memory: "4Gi"
              command: ["python", "leader.py"]
        workerTemplate:
          metadata: {}
          spec:
            containers:
            - name: worker
              image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
              resources:
                requests:
                  cpu: "2"
                  memory: "4Gi"
                  nvidia.com/gpu: "1"
              command: ["python", "worker.py"]
      rolloutStrategy:
        type: RollingUpdate
        rollingUpdateConfiguration:
          maxSurge: 1
          maxUnavailable: 1
      startupPolicy: LeaderCreated
    1. kueue.x-k8s.io/queue-name: Specifies the LocalQueue that manages this LeaderWorkerSet. Kueue admits all groups together.
    2. replicas: 2: Creates 2 leader-worker groups, each managed independently.
    3. size: 3: Each group consists of 1 leader and 2 workers (size = total pods including leader).
  2. Apply the LeaderWorkerSet:

    kubectl apply -f lws-training.yaml
  3. Monitor the admission:

    kubectl get workloads -n team-ml
  4. Check the LeaderWorkerSet status:

    kubectl get leaderworkersets lws-training -n team-ml
  5. View individual pods:

    kubectl get pods -n team-ml -l leaderworkerset.sigs.k8s.io/name=lws-training