Alauda Build of NVIDIA DRA Driver for GPUs
TOC
IntroductionPrerequisitesInstallationInstall the NVIDIA driver on GPU nodesInstall the NVIDIA Container ToolkitEnable CDI in containerdEnable DRA in KubernetesDownload the cluster pluginUpload the cluster pluginInstall Alauda Build of NVIDIA DRA Driver for GPUsVerify the DRA setupValidate the installationRun a validation workloadIntroduction
Dynamic Resource Allocation (DRA) is a Kubernetes feature that provides a more flexible and extensible way to request and allocate hardware resources such as GPUs. Unlike traditional device plugins, which only support simple counting of identical resources, DRA enables fine-grained device selection based on device attributes and capabilities.
Alauda Build of NVIDIA DRA Driver for GPUs is delivered as a cluster plugin that brings the upstream NVIDIA DRA driver to your ACP cluster, allowing workloads to claim GPUs through ResourceClaim and ResourceClaimTemplate objects.
Prerequisites
- NVIDIA driver v565+ installed on every GPU node.
- Kubernetes v1.32+.
- ACP v4.1+.
- Cluster administrator access to the target ACP cluster.
- CDI enabled in the underlying container runtime (such as containerd).
- DRA and the corresponding API groups enabled on the cluster.
The sections below walk through enabling CDI and DRA if they are not yet configured.
Installation
Install the NVIDIA driver on GPU nodes
Refer to the NVIDIA CUDA Installation Guide for Linux.
Install the NVIDIA Container Toolkit
Refer to the NVIDIA Container Toolkit installation guide.
Enable CDI in containerd
CDI (Container Device Interface) provides a standard mechanism for device vendors to describe everything required to provide access to a specific resource — such as a GPU — beyond a simple device name.
CDI is enabled by default in containerd 2.0 and later. For earlier versions (from 1.7.0), it must be activated manually.
The following steps are only required on GPU nodes running containerd v1.7.x.
-
Edit the containerd configuration file:
Add or modify the following section:
NOTESetting
enable_cdi = trueis sufficient. containerd's defaultcdi_spec_dirsalready include/etc/cdiand/var/run/cdi, which is where the NVIDIA Container Toolkit writes its CDI specs. Only setcdi_spec_dirsexplicitly if your toolkit is configured to emit specs to a different location. -
Restart containerd and confirm it is running correctly:
-
Verify that CDI is enabled:
If matching log lines appear, CDI was enabled successfully.
Enable DRA in Kubernetes
DRA is enabled by default in Kubernetes 1.34 and later. For earlier versions (from 1.32), it must be activated manually.
The following steps apply to Kubernetes 1.32–1.33. Apply the control-plane changes on all master nodes and the kubelet change on all nodes.
-
Edit the
kube-apiservermanifest at/etc/kubernetes/manifests/kube-apiserver.yaml.For Kubernetes 1.32:
For Kubernetes 1.33:
-
Edit the
kube-controller-managermanifest at/etc/kubernetes/manifests/kube-controller-manager.yaml: -
Edit the
kube-schedulermanifest at/etc/kubernetes/manifests/kube-scheduler.yaml: -
Edit the kubelet configuration at
/var/lib/kubelet/config.yamlon all nodes:Restart the kubelet:
Download the cluster plugin
The Alauda Build of NVIDIA DRA Driver for GPUs cluster plugin can be retrieved from the Customer Portal. Contact Customer Support for more information.
Upload the cluster plugin
Upload the downloaded package with the violet command-line tool. For details, see Upload Packages.
Install Alauda Build of NVIDIA DRA Driver for GPUs
-
Label each GPU node so the
nvidia-dra-driver-gpu-kubelet-pluginis scheduled onto it:WARNINGOn the same node you can set only one of the following labels:
gpu=on,nvidia-device-enable=pgpu, ornvidia-device-enable=pgpu-dra. -
Navigate to Administrator > Marketplace > Cluster Plugins, switch to the target cluster, and deploy the Alauda Build of NVIDIA DRA Driver for GPUs cluster plugin.
Verify the DRA setup
-
Check the DRA driver and controller pods:
The output should be similar to:
-
Verify the
ResourceSliceobjects:For a GPU node, the output should be similar to:
Validate the installation
This section assumes that you have completed the installation steps above and that all relevant GPU components are running and in a Ready state. The following workload confirms that the Alauda Build of NVIDIA DRA Driver for GPUs is working end to end.
Run a validation workload
-
Create the workload spec. Adjust the selector
expressionto match aproductNamereported in your ownResourceSliceoutput: -
Apply the spec:
-
Inspect the container logs:
The output should show the GPU UUID from inside the container, for example: