Set up a GPU enabled Kubernetes Cluster

HPE Machine Learning Data Management leverages Kubernetes Device Plugins to let Kubernetes Pods access specialized hardware such as GPUs. For instructions on how to set up a GPU-enabled Kubernetes cluster through device plugins, see the Kubernetes documentation.

HPE Machine Learning Data Management on NVIDIA DGX A100

Let’s walk through the main steps allowing HPE Machine Learning Data Management to leverage the AI performance of your DGX A100 GPUs.


Support for scheduling GPU workloads in Kubernetes requires a fair amount of trial and effort. To ease the process:

  • This setup page will walk you through very detailed installation steps to prepare your Kubernetes cluster.
  • Take advantage of a user’s past experience in this blog.

Here is a quick recap of what will be needed:

  • Have a working Kubernetes control plane and worker nodes attached to your cluster.
  • Install the DGX system in a hosting environment.
  • Add the DGX to your K8s API server as a worker node.

Now that the DGX is added to your API server, you can then proceed to:

  1. Enable the GPU worker node in the Kubernetes cluster by installing NVIDIA’s dependencies:

    Dependencies packages and deployment methods may vary. The following list is not exhaustive and is intended to serve as a general guideline.

    • NVIDIA drivers

      For complete instructions on setting up NVIDIA drivers, visit this quickstart guide or check this summary of the steps.

    • NVIDIA Container Toolkit (nvidia-docker2)

      You may need to use different packages depending on your container engine.

    • NVIDIA Kubernetes Device Plugin

      To use GPUs in Kubernetes, the NVIDIA Device Plugin is required. The NVIDIA Device Plugin is a daemonset that enumerates the number of GPUs on each node of the cluster and allows pods to be run on GPUs. Follow those steps to deploy the device plugin as a daemonset using helm.

    Checkpoint: Run NVIDIA System Management Interface (nvidia-smi) on the CLI. It should return the list of NVIDIA GPUs.

  2. Test a sample container with GPU:

    To test whether CUDA jobs can be deployed, run a sample CUDA (vectorAdd) application.

    For reference, find the pod spec below:

    apiVersion: v1
    kind: Pod
      name: gpu-test
      restartPolicy: OnFailure
      - name: cuda-vector-add
        image: "nvidia/samples:vectoradd-cuda10.2"

    Save it as gpu-pod.yaml then deploy the application:

    kubectl apply -f gpu-pod.yaml

    Check the logs to make sure that the app completed successfully:

    kubectl get pods gpu-test
  3. If the container above is scheduled successfully: install HPE Machine Learning Data Management. You are ready to start leveraging NVIDIA’s GPUs in your HPE Machine Learning Data Management pipelines.


Configure GPUs in Pipelines

Once your GPU-enabled Kubernetes cluster is set, you can request a GPU tier in your pipeline specifications by setting up GPU resource limits, along with its type and number of GPUs.


By default, HPE Machine Learning Data Management workers are spun up and wait for new input. That works great for pipelines that are processing a lot of new incoming commits. However, for lower volume of input commits, you could have your pipeline workers ’taking’ the GPU resource as far as k8s is concerned, but ‘idling’ as far as you are concerned.

  • Make sure to set the autoscaling field to true so that if your pipeline is not getting used, the worker pods get spun down and the GPU resource freed.
  • Additionally, specify how much of GPU your pipeline worker will need via the resourceRequests fields in your pipeline specification with resourceRequests <= resourceLimits.

Below is an example of a pipeline spec for a GPU-enabled pipeline from our market sentiment analysis example:

  "pipeline": {
    "name": "train_model"
  "description": "Fine tune a BERT model for sentiment analysis on financial data.",
  "input": {
    "cross": [
        "pfs": {
          "repo": "dataset",
          "glob": "/"
        "pfs": {
          "repo": "language_model",
          "glob": "/"
  "transform": {
    "cmd": [
      "python", "", "--lm_path", "/pfs/language_model/", "--cl_path", "/pfs/out", "--cl_data_path", "/pfs/dataset/"
    "image": "pachyderm/market_sentiment:dev0.25"
  "resourceLimits": {
    "gpu": {
      "type": "",
      "number": 1
  "resourceRequests": {
    "memory": "4G",
    "cpu": 1