Installation
TOC
PrerequisitesCommon requirementsMindIO SDK (optional)ProcedureStep 1: Sync driver images and configure ImageWhiteList1.1 Pull driver images from Docker Hub1.2 Allow the driver image in ImageWhiteList1.3 VerifyStep 2: Download packagesStep 3: Upload packagesStep 4: Install the Node Feature Discovery cluster pluginStep 5: Install the Alauda Build of NPU Operator5.1 Label nodes5.2 Install the operator5.3 Create the NPUOperatorCtl instanceVerificationStep 6: Verify monitoringWhat's nextFAQWhere isnpu-smi installed on the host?Do I still need runtimeClassName: ascend on workload pods?What should I pay attention to when uninstalling Alauda Build of NPU Operator?Prerequisites
Common requirements
- ACP version: v4.0 or later.
- Cluster administrator access to the target ACP cluster.
- Supported NPU hardware. NPU worker nodes must carry one of:
Ascend 910BAscend 310P
Alauda Build of Node Feature Discoverycluster plugin must be installed. The operator reads NPU presence and kernel/OS labels from NFD to decide which driver image to pull on each node.
MindIO SDK (optional)
If you plan to enable MindIO TFT or MindIO ACP, separately stage the matching MindIO SDK package on each NPU node under /opt/openFuyao/mindio/. Skip this step otherwise.
Procedure
Step 1: Sync driver images and configure ImageWhiteList
Skip this entire step if your NPU nodes already have a Huawei driver installed out-of-band (typically via the .run package, at /usr/local/Ascend/driver or /var/lib/Ascend/driver). In that case, disable Driver in Step 5.3 — the operator will configure CDI / device-plugin / runtime against the host's existing driver and never pull a driver image.
This is the most common cause of a failed install — do it first. The driver image is pulled at runtime by the driver DaemonSet based on each node's kernel label, and lives at the mlops/ascend-driver path — not bundled with the operator. If the matching tag isn't in your cluster registry, or isn't listed in an ImageWhiteList, the DaemonSet stays in ImagePullBackOff and the operator never reaches Ready. If no tag matches your kernel at the source Docker Hub repository, contact Customer Support to have one built — see §1.1.
You do NOT need to edit /etc/containerd/config.toml manually to enable CDI. The operator ships an ascend-runtime-containerd DaemonSet that runs on every NPU node and idempotently flips enable_cdi = true and adds the default cdi_spec_dirs (/var/run/cdi, /etc/cdi) in containerd's config, then SIGHUPs containerd. This works on both containerd 1.7.x (where CDI is off by default) and 2.x (where it is on by default).
After upgrading the host's containerd package, the package install may replace /etc/containerd/config.toml with its default template, reverting the operator's enable_cdi = true patch. CDI device injection will silently fail for any NPU pod scheduled afterward. Restart the containerd-config sidecar so it re-applies the patch:
(Replace npu-operator with your install namespace if you chose a different one.)
1.1 Pull driver images from Docker Hub
The driver image is shipped independently on Docker Hub at docker.io/alaudadockerhub/ascend-driver — not as part of the operator bundle — because the driver .ko binaries are kernel-specific and the kernel list grows over time. Each tag follows the pattern <HDK>-<chip>-<kernel>-<os-stem>, for example 25.5.0-910b-6.6.0-145.0.4.135-oe2403sp3. Pick the tag matching your nodes' uname -r and chip.
No matching tag for your kernel? Do not try to compile a driver image yourself or fall back to a .run install on the host — contact Customer Support with the output of uname -r and your chip model (e.g. Ascend 910B4, Ascend 310P3). A new tag for your kernel will be built and published to the same Docker Hub repository, no operator code changes required. Proceeding without a matching image will leave the driver DaemonSet in ImagePullBackOff indefinitely.
List available tags:
On a machine with internet access, mirror each selected tag into your cluster registry:
The operator configuration defaults spec.driver.image.repository to mlops/ascend-driver; override it in the deployment form if your registry uses a different namespace.
1.2 Allow the driver image in ImageWhiteList
ACP gates which images pods are allowed to pull. Every driver-image tag the DaemonSet may pull must be listed explicitly in an ImageWhiteList.
Create one (or extend the existing ascend-driver ImageWhiteList in cpaas-system):
Add one repoList entry per <chip, kernel> tag you mirrored. Each entry is a full image reference including the tag (the API does not accept bare repository paths). If you override spec.driver.image.repository later, list the new path instead.
If your platform doesn't enforce ImageWhiteList (Allow policy by default), this sub-step is a no-op — kubelet still authenticates to the registry, so credentials gate the actual pull.
1.3 Verify
On each NPU node, the kubelet should be able to pull a driver image (use the tag matching the node's kernel):
If this succeeds, Step 1 is complete. A later ImagePullBackOff on the driver DaemonSet usually means a missing tag in the registry, a registry credential issue, or a missing repoList entry in the ImageWhiteList.
Step 2: Download packages
From the Marketplace on the Customer Portal website, download:
- The Alauda Build of NPU Operator operator package (delivered as an OLM OperatorBundle).
- The Alauda Build of Node Feature Discovery cluster plugin package.
- (Optional) The Volcano cluster plugin package — only needed if you plan to enable the ClusterD component during deployment.
Step 3: Upload packages
The platform provides the violet command-line tool for uploading both operator packages and cluster plugin packages downloaded from the Customer Portal Marketplace.
For details, see Upload Packages.
Step 4: Install the Node Feature Discovery cluster plugin
Alauda Build of Node Feature Discovery is a cluster plugin, not an operator. Install it first because the NPU Operator depends on its node labelling.
- Navigate to Administrator > Marketplace > Cluster Plugins.
- Switch to the target cluster.
- Locate Alauda Build of Node Feature Discovery and click Install.
The Volcano cluster plugin can be left uninstalled for now. Install it from the same Cluster Plugins page only if you later enable the ClusterD component of the NPU Operator.
Step 5: Install the Alauda Build of NPU Operator
Alauda Build of NPU Operator is delivered as an operator (OLM bundle). Installation has two distinct sub-steps on the platform UI:
- Install the operator — the OperatorHub flow only brings up the operator's controller pods (
npu-operator-controller-manager+npu-operator). It does not deploy any driver, device plugin, or other NPU components. - Create an
NPUOperatorCtlinstance — only at this step do you fill in the deployment form, and only after the instance is created do the controller pods start reconciling and rolling out the NPU components onto the nodes.
5.1 Label nodes
Apply the label masterselector=dls-master-node to all master nodes and the label workerselector=dls-worker-node to the worker nodes that should host NPU components:
5.2 Install the operator
-
Navigate to Administrator > Marketplace > OperatorHub, switch to the target cluster, and locate the Alauda Build of NPU Operator entry. If the status is Absent, confirm the operator package was uploaded with
violetin Step 3. -
Click the operator to open its details page, then click Install.
-
On the install page, leave Channel unchanged, confirm Version, leave Installation Location as
npu-operator(the default; all NPU components created in the next sub-step land here), and select Manual for Upgrade Strategy. Click Install. -
Wait for the subscription to reach Succeeded. The Alauda Build of NPU Operator tile should transition from Installing to Installed, and
kubectl -n npu-operator get podwill show the two controller pods (npu-operatorandnpu-operator-controller-manager)Running.
At this point no driver pod, device plugin, or other NPU pod is running yet. The controller pods are idle and waiting for an NPUOperatorCtl instance. If you stop here the NPU nodes will not be configured.
5.3 Create the NPUOperatorCtl instance
The deployment form opens when you create the instance, not when you install the operator above.
-
On the Installed Operators page, click the Alauda Build of NPU Operator tile, then click Create Instance (or open the NPUOperatorCtl tab and click Create NPUOperatorCtl).
-
Fill in the form (see the table below) and click Create.
-
The operator immediately reconciles: driver / device plugin / runtime sidecar / exporter / rebooter DaemonSets land on every NPU node, and the controller updates the
NPUOperatorCtlstatus.conditionstoDeployed=True / UpgradeSuccessfulonce everything is up.
Deployment form parameter description:
If a component listed in the table below is already installed on the cluster by some other path (for example a hand-rolled Ascend Operator), disable the corresponding switch here so the NPU Operator does not fight it.
Ascend Operator, NodeD, ClusterD, Resilience Controller, MindIO TFT, and MindIO ACP are not deployed by default. Please deploy them only when there is a clear need for them.
Verification
-
Confirm the
NPUOperatorCtlinstance is reconciling cleanly:The
Deployedcondition should beTruewith reasonUpgradeSuccessful. (Replacenpu-operatorbelow and in subsequent commands with the namespace you chose at install time if it differs.) -
Wait for the
npu-driverpod to becomeRunning. First-time install takes a few minutes for the driver image to pull and the modules to be inserted into the host kernel: -
Check that the NPU node is now reporting allocatable Ascend devices:
-
(Optional) Run
npu-smi infoon the host. The operator does not symlinknpu-smiinto the hostPATH(/usris read-only on KubeOS), so call the binary directly with its libraries loaded:Each card should report
Health: OKand a non-zeroBus-Id. -
Validate end-to-end with a sample NPU workload. v1.2.4 no longer requires
runtimeClassName: ascend— the resource request alone triggers CDI device injection. In air-gapped or image-whitelist-enforced clusters, mirror the sample image to your cluster registry first, or replace it with an equivalent internal test image that includesnpu-smi.The pod should reach
Running.ls /dev/davinci*should show/dev/davinci_managerplus one per-card device node (e.g./dev/davinci0), andnpu-smi infoshould print the card's status. Both confirm that CDI injected the device into the container.
Step 6: Verify monitoring
If the NPU Exporter component was deployed when installing the Alauda Build of NPU Operator, the operator automatically deploys a ServiceMonitor named npu-exporter-servicemonitor in the operator namespace, wired up to the npu-exporter Service. No manual ServiceMonitor creation is required. You can verify it with:
To get a Grafana dashboard, import the JSON file by following Import Dashboard.
The JSON file is available in ascend-npu-dashboard.
Tags in the Grafana dashboard JSON file cannot contain non-ASCII characters and need to be edited out. For example:
After modification:
What's next
- Driver upgrade and self-healing — how to roll the driver version forward and how the chip self-healing path works.
FAQ
Where is npu-smi installed on the host?
In v1.2.4 the driver pod stages the Huawei tools tree to /var/lib/Ascend/driver/, so the binary is at /var/lib/Ascend/driver/tools/npu-smi. No host PATH symlink is created (KubeOS keeps /usr read-only). Call it with the matching LD_LIBRARY_PATH:
If you prefer a PATH-resolvable command, write a small wrapper into a writable location (e.g. /opt/bin/npu-smi) that exports LD_LIBRARY_PATH and execs the real binary.
Do I still need runtimeClassName: ascend on workload pods?
No. v1.2.4 uses CDI for device injection: requesting huawei.com/Ascend910 (or Ascend310P) is enough. Existing manifests that still set runtimeClassName: ascend continue to work — the RuntimeClass is kept for backwards compatibility — but no new manifests need it.
What should I pay attention to when uninstalling Alauda Build of NPU Operator?
Uninstalling the operator removes the driver DaemonSet, but the driver modules already loaded into the host kernel stay loaded — rmmod would risk leaving the chip in an unrecoverable state. To fully remove the driver from a host, reboot the node after the operator is uninstalled; the modules will not auto-reload because the DaemonSet is gone.
Files staged to the host can be cleaned up manually if needed:
Run these only after the operator and its driver pod have been removed, and after the node has been rebooted (or before, if you intend to reboot anyway).