Managing distributed workloads

As an administrator, you can manage distributed workloads in Alauda AI by configuring Kueue quota management, setting up resource flavors for GPU nodes, and troubleshooting common issues.

Distributed workloads such as RayJob, RayCluster, and PyTorchJob are created by their respective operators (KubeRay, Kubeflow). Alauda Build of Kueue provides the queueing and admission control layer, deciding when these workloads are allowed to run based on the cluster-wide quotas you configure.

#Managing distributed workloads

Managing distributed workloads