Installation

Prerequisites

Common requirements

  1. ACP version: v4.0 or later.
  2. Cluster administrator access to the target ACP cluster.
  3. Supported NPU hardware. NPU worker nodes must carry one of:
    • Ascend 910B
    • Ascend 310P
  4. Alauda Build of Node Feature Discovery cluster plugin must be installed. The operator reads NPU presence and kernel/OS labels from NFD to decide which driver image to pull on each node.

MindIO SDK (optional)

If you plan to enable MindIO TFT or MindIO ACP, separately stage the matching MindIO SDK package on each NPU node under /opt/openFuyao/mindio/. Skip this step otherwise.

Procedure

Step 1: Sync driver images and configure ImageWhiteList

TIP

Skip this entire step if your NPU nodes already have a Huawei driver installed out-of-band (typically via the .run package, at /usr/local/Ascend/driver or /var/lib/Ascend/driver). In that case, disable Driver in Step 5.3 — the operator will configure CDI / device-plugin / runtime against the host's existing driver and never pull a driver image.

WARNING

This is the most common cause of a failed install — do it first. The driver image is pulled at runtime by the driver DaemonSet based on each node's kernel label, and lives at the mlops/ascend-driver path — not bundled with the operator. If the matching tag isn't in your cluster registry, or isn't listed in an ImageWhiteList, the DaemonSet stays in ImagePullBackOff and the operator never reaches Ready. If no tag matches your kernel at the source Docker Hub repository, contact Customer Support to have one built — see §1.1.

INFO

You do NOT need to edit /etc/containerd/config.toml manually to enable CDI. The operator ships an ascend-runtime-containerd DaemonSet that runs on every NPU node and idempotently flips enable_cdi = true and adds the default cdi_spec_dirs (/var/run/cdi, /etc/cdi) in containerd's config, then SIGHUPs containerd. This works on both containerd 1.7.x (where CDI is off by default) and 2.x (where it is on by default).

CAUTION

After upgrading the host's containerd package, the package install may replace /etc/containerd/config.toml with its default template, reverting the operator's enable_cdi = true patch. CDI device injection will silently fail for any NPU pod scheduled afterward. Restart the containerd-config sidecar so it re-applies the patch:

kubectl -n npu-operator rollout restart ds/ascend-runtime-containerd

(Replace npu-operator with your install namespace if you chose a different one.)

1.1 Pull driver images from Docker Hub

The driver image is shipped independently on Docker Hub at docker.io/alaudadockerhub/ascend-driver — not as part of the operator bundle — because the driver .ko binaries are kernel-specific and the kernel list grows over time. Each tag follows the pattern <HDK>-<chip>-<kernel>-<os-stem>, for example 25.5.0-910b-6.6.0-145.0.4.135-oe2403sp3. Pick the tag matching your nodes' uname -r and chip.

WARNING

No matching tag for your kernel? Do not try to compile a driver image yourself or fall back to a .run install on the host — contact Customer Support with the output of uname -r and your chip model (e.g. Ascend 910B4, Ascend 310P3). A new tag for your kernel will be built and published to the same Docker Hub repository, no operator code changes required. Proceeding without a matching image will leave the driver DaemonSet in ImagePullBackOff indefinitely.

List available tags:

curl -s 'https://hub.docker.com/v2/repositories/alaudadockerhub/ascend-driver/tags/?page_size=100' \
  | jq -r '.results[].name'

On a machine with internet access, mirror each selected tag into your cluster registry:

TAG=25.5.0-<chip>-<kernel>-<os-stem>

skopeo copy --all \
  docker://docker.io/alaudadockerhub/ascend-driver:$TAG \
  docker://<your-cluster-registry>/mlops/ascend-driver:$TAG

The operator configuration defaults spec.driver.image.repository to mlops/ascend-driver; override it in the deployment form if your registry uses a different namespace.

1.2 Allow the driver image in ImageWhiteList

ACP gates which images pods are allowed to pull. Every driver-image tag the DaemonSet may pull must be listed explicitly in an ImageWhiteList.

Create one (or extend the existing ascend-driver ImageWhiteList in cpaas-system):

apiVersion: app.alauda.io/v1alpha1
kind: ImageWhiteList
metadata:
  name: ascend-driver
  namespace: cpaas-system
spec:
  repoList:
  # one entry per tag you mirrored in step 1.1 — full image reference, not bare repo path
  - <your-cluster-registry>/mlops/ascend-driver:25.5.0-910b-<kernel>-<os-stem>
  - <your-cluster-registry>/mlops/ascend-driver:25.5.0-310p-<kernel>-<os-stem>

Add one repoList entry per <chip, kernel> tag you mirrored. Each entry is a full image reference including the tag (the API does not accept bare repository paths). If you override spec.driver.image.repository later, list the new path instead.

TIP

If your platform doesn't enforce ImageWhiteList (Allow policy by default), this sub-step is a no-op — kubelet still authenticates to the registry, so credentials gate the actual pull.

1.3 Verify

On each NPU node, the kubelet should be able to pull a driver image (use the tag matching the node's kernel):

crictl pull <your-cluster-registry>/mlops/ascend-driver:$TAG

If this succeeds, Step 1 is complete. A later ImagePullBackOff on the driver DaemonSet usually means a missing tag in the registry, a registry credential issue, or a missing repoList entry in the ImageWhiteList.

Step 2: Download packages

INFO

From the Marketplace on the Customer Portal website, download:

  • The Alauda Build of NPU Operator operator package (delivered as an OLM OperatorBundle).
  • The Alauda Build of Node Feature Discovery cluster plugin package.
  • (Optional) The Volcano cluster plugin package — only needed if you plan to enable the ClusterD component during deployment.

Step 3: Upload packages

The platform provides the violet command-line tool for uploading both operator packages and cluster plugin packages downloaded from the Customer Portal Marketplace.

For details, see Upload Packages.

Step 4: Install the Node Feature Discovery cluster plugin

Alauda Build of Node Feature Discovery is a cluster plugin, not an operator. Install it first because the NPU Operator depends on its node labelling.

  1. Navigate to Administrator > Marketplace > Cluster Plugins.
  2. Switch to the target cluster.
  3. Locate Alauda Build of Node Feature Discovery and click Install.
TIP

The Volcano cluster plugin can be left uninstalled for now. Install it from the same Cluster Plugins page only if you later enable the ClusterD component of the NPU Operator.

Step 5: Install the Alauda Build of NPU Operator

Alauda Build of NPU Operator is delivered as an operator (OLM bundle). Installation has two distinct sub-steps on the platform UI:

  1. Install the operator — the OperatorHub flow only brings up the operator's controller pods (npu-operator-controller-manager + npu-operator). It does not deploy any driver, device plugin, or other NPU components.
  2. Create an NPUOperatorCtl instance — only at this step do you fill in the deployment form, and only after the instance is created do the controller pods start reconciling and rolling out the NPU components onto the nodes.

5.1 Label nodes

Apply the label masterselector=dls-master-node to all master nodes and the label workerselector=dls-worker-node to the worker nodes that should host NPU components:

kubectl label nodes <master-node-id> masterselector=dls-master-node
kubectl label nodes <worker-node-id> workerselector=dls-worker-node

5.2 Install the operator

  1. Navigate to Administrator > Marketplace > OperatorHub, switch to the target cluster, and locate the Alauda Build of NPU Operator entry. If the status is Absent, confirm the operator package was uploaded with violet in Step 3.

  2. Click the operator to open its details page, then click Install.

  3. On the install page, leave Channel unchanged, confirm Version, leave Installation Location as npu-operator (the default; all NPU components created in the next sub-step land here), and select Manual for Upgrade Strategy. Click Install.

  4. Wait for the subscription to reach Succeeded. The Alauda Build of NPU Operator tile should transition from Installing to Installed, and kubectl -n npu-operator get pod will show the two controller pods (npu-operator and npu-operator-controller-manager) Running.

WARNING

At this point no driver pod, device plugin, or other NPU pod is running yet. The controller pods are idle and waiting for an NPUOperatorCtl instance. If you stop here the NPU nodes will not be configured.

5.3 Create the NPUOperatorCtl instance

The deployment form opens when you create the instance, not when you install the operator above.

  1. On the Installed Operators page, click the Alauda Build of NPU Operator tile, then click Create Instance (or open the NPUOperatorCtl tab and click Create NPUOperatorCtl).

  2. Fill in the form (see the table below) and click Create.

  3. The operator immediately reconciles: driver / device plugin / runtime sidecar / exporter / rebooter DaemonSets land on every NPU node, and the controller updates the NPUOperatorCtl status.conditions to Deployed=True / UpgradeSuccessful once everything is up.

Deployment form parameter description:

WARNING

If a component listed in the table below is already installed on the cluster by some other path (for example a hand-rolled Ascend Operator), disable the corresponding switch here so the NPU Operator does not fight it.

TIP

Ascend Operator, NodeD, ClusterD, Resilience Controller, MindIO TFT, and MindIO ACP are not deployed by default. Please deploy them only when there is a clear need for them.

ComponentDefaultDescription
DriverEnabledWhether the operator manages the Ascend driver. Disable on nodes that already have a Huawei .run driver installed (at /usr/local/Ascend/driver or /var/lib/Ascend/driver) — the operator then skips driver staging and upgrades and reuses the host's existing tree for CDI / device-plugin / runtime.
Driver Version25.5.0Driver and firmware HDK version. Pick a version that you have a matching pre-staged image for (see Step 1). Currently supported: 25.5.0 (default), 25.3.RC1. Hidden when Driver is disabled.
Auto Driver Upgrade RebootDisabledWhen a driver upgrade needs a node reboot, reboot automatically (cordon + drain first). Off (recommended for production) emits a RebootRequired Event and waits for an administrator to approve via the node annotation npu.openfuyao.com/approve-reboot=true. See Driver upgrade and self-healing. Hidden when Driver is disabled.
Auto Chip-Failure Recovery RebootDisabledWhen the driver health-watch detects a wedged chip at runtime, reboot the node automatically to recover it (cordon + drain first). Off (default) emits a RebootRequired Event and waits for the same admin annotation. Turn this On only for inference clusters whose clients can retry requests; keep it Off for long-running training jobs. Hidden when Driver is disabled.
Ascend Device PluginEnabledWhether to install Ascend Device Plugin.
Ascend Docker RuntimeEnabledWhether to install the container-runtime CDI generator. In v1.2.4 this component runs npu-container-toolkit generate-cdi --watch as a sidecar that emits the CDI spec for the device plugin to reference — workloads no longer need runtimeClassName: ascend.
NPU ExporterEnabledWhether to install NPU Exporter.
Ascend OperatorDisabledWhether to install Ascend Operator.
NodeDDisabledWhether to install NodeD.
ClusterDDisabledWhether to install ClusterD. Requires the Volcano cluster plugin to be installed first.
Resilience ControllerDisabledWhether to install Resilience Controller.
MindIO TFTDisabledWhether to install MindIO TFT.
MindIO ACPDisabledWhether to install MindIO ACP.

Verification

  1. Confirm the NPUOperatorCtl instance is reconciling cleanly:

    kubectl -n npu-operator get npuoperatorctl

    The Deployed condition should be True with reason UpgradeSuccessful. (Replace npu-operator below and in subsequent commands with the namespace you chose at install time if it differs.)

  2. Wait for the npu-driver pod to become Running. First-time install takes a few minutes for the driver image to pull and the modules to be inserted into the host kernel:

    kubectl -n npu-operator get pod -w | grep npu-driver
  3. Check that the NPU node is now reporting allocatable Ascend devices:

    kubectl get node ${nodeName} -o jsonpath='{.status.allocatable}'
    # Example output includes:
    #   "huawei.com/Ascend910":"8"   (910B nodes; specific value depends on card count)
    #   "huawei.com/Ascend310P":"1"  (310P nodes)
  4. (Optional) Run npu-smi info on the host. The operator does not symlink npu-smi into the host PATH (/usr is read-only on KubeOS), so call the binary directly with its libraries loaded:

    LD_LIBRARY_PATH=/var/lib/Ascend/driver/lib64/driver:/var/lib/Ascend/driver/lib64/common \
      /var/lib/Ascend/driver/tools/npu-smi info

    Each card should report Health: OK and a non-zero Bus-Id.

  5. Validate end-to-end with a sample NPU workload. v1.2.4 no longer requires runtimeClassName: ascend — the resource request alone triggers CDI device injection. In air-gapped or image-whitelist-enforced clusters, mirror the sample image to your cluster registry first, or replace it with an equivalent internal test image that includes npu-smi.

    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: npu-smoke
    spec:
      restartPolicy: Never
      containers:
      - name: probe
        image: ascendai/pytorch:ubuntu-python3.8-cann8.0.rc1.beta1-pytorch2.1.0
        command: ["bash", "-c"]
        args:
          - |
            ls /dev/davinci*
            npu-smi info
            sleep 3600
        resources:
          limits:
            huawei.com/Ascend910: 1   # Change to huawei.com/Ascend310P on a 310P node
    EOF
    
    kubectl logs npu-smoke

    The pod should reach Running. ls /dev/davinci* should show /dev/davinci_manager plus one per-card device node (e.g. /dev/davinci0), and npu-smi info should print the card's status. Both confirm that CDI injected the device into the container.

Step 6: Verify monitoring

If the NPU Exporter component was deployed when installing the Alauda Build of NPU Operator, the operator automatically deploys a ServiceMonitor named npu-exporter-servicemonitor in the operator namespace, wired up to the npu-exporter Service. No manual ServiceMonitor creation is required. You can verify it with:

kubectl -n npu-operator get servicemonitor npu-exporter-servicemonitor

To get a Grafana dashboard, import the JSON file by following Import Dashboard.

The JSON file is available in ascend-npu-dashboard.

NOTE

Tags in the Grafana dashboard JSON file cannot contain non-ASCII characters and need to be edited out. For example:

{
  "tags": [
    "ascend",
    "昇腾"
  ]
}

After modification:

{
  "tags": [
    "ascend"
  ]
}

What's next

FAQ

Where is npu-smi installed on the host?

In v1.2.4 the driver pod stages the Huawei tools tree to /var/lib/Ascend/driver/, so the binary is at /var/lib/Ascend/driver/tools/npu-smi. No host PATH symlink is created (KubeOS keeps /usr read-only). Call it with the matching LD_LIBRARY_PATH:

LD_LIBRARY_PATH=/var/lib/Ascend/driver/lib64/driver:/var/lib/Ascend/driver/lib64/common \
  /var/lib/Ascend/driver/tools/npu-smi info

If you prefer a PATH-resolvable command, write a small wrapper into a writable location (e.g. /opt/bin/npu-smi) that exports LD_LIBRARY_PATH and execs the real binary.

Do I still need runtimeClassName: ascend on workload pods?

No. v1.2.4 uses CDI for device injection: requesting huawei.com/Ascend910 (or Ascend310P) is enough. Existing manifests that still set runtimeClassName: ascend continue to work — the RuntimeClass is kept for backwards compatibility — but no new manifests need it.

What should I pay attention to when uninstalling Alauda Build of NPU Operator?

Uninstalling the operator removes the driver DaemonSet, but the driver modules already loaded into the host kernel stay loaded — rmmod would risk leaving the chip in an unrecoverable state. To fully remove the driver from a host, reboot the node after the operator is uninstalled; the modules will not auto-reload because the DaemonSet is gone.

Files staged to the host can be cleaned up manually if needed:

rm -rf /var/lib/Ascend /var/lib/ascend /home/bios/driver /etc/ascend_install.info /run/ascend

Run these only after the operator and its driver pod have been removed, and after the node has been rebooted (or before, if you intend to reboot anyway).