Heterogeneous GPU Support

This guide explains how to set up HPE Machine Learning Inferencing Software to support heterogeneous GPU environments.

With heterogenous GPU support enabled, users can specify the desired GPU type to be used for their deployed inference service by passing an argument (--gpu-type) when defining a packaged model. The platform will then schedule the inference service on a node equipped with the specified GPU type.

If the Kubernetes cluster is configured to use taints to prevent GPU scheduling of a particular GPU type, then the gpuType is required to enable access that GPU.
If the specified GPU type is not available, the platform will fail to schedule the inference service.
If the Kubernetes cluster is not configured to use taints and no GPU type is specified, the platform will schedule the inference service on any node with the number of GPUs requested.

How to Add Heterogeneous GPU Support #

Label Nodes with GPU Type Names #

Use the following command to label the cluster’s nodes with the GPU type names you want to surface to your users:

kubectl label nodes <NODE_NAME> cloud.google.com/gke-accelerator=<GPU_TYPE_NAME>

Label Requirements #

Label Name: Must be cloud.google.com/gke-accelerator
Value: The GPU type name you want to surface to your users

Example #

If a node named MYNODE has an Nvidia A100 GPU, you could enable selection of the gpu type name nvidia-tesla-a100 with the following command:

kubectl label nodes MYNODE cloud.google.com/gke-accelerator=nvidia-tesla-a100

Configure Helm Chart #

Configure the Helm chart’s gpuSelector section. This section’s required configuration changes based on the environment in which you are deploying the platform.

In a GKE environment, the gke flag is auto-detected by checking for GKE Custom Resource Definitions (CRDs). You do not need to set it explicitly unless you intend to override the automatic detection and force enablement or disablement. In this scenario, you do not need the tolerationKey value and can disable it by setting it to an empty string.

gpuSelector:
  # gke is auto-detected, so you typically do not need to set it explicitly.
  # To override the detection, you can uncomment and set it as needed:
  # gke: true  # Force enable GKE-specific configurations.
  # gke: false  # Force disable GKE-specific configurations.
  tolerationKey: ""  # Disable tolerations since GKE handles it via node selectors.

In clusters where GPU nodes are tainted to prevent scheduling, you need to configure tolerations so that the inference service can be scheduled on these nodes. The tolerationKey value defaults to accelerator and is used to generate the necessary tolerations for scheduling pods on nodes with the specified GPU types. If the default value of tolerationKey is appropriate, you do not need to define it explicitly.

# gpuSelector defines the configuration used when deploying a packaged model that
# specifies a gpuType value.
gpuSelector:
  gke: false  # Explicitly disable GKE-specific configurations.
  tolerationKey: "accelerator"  # Key for generating tolerations to match node taints.

For example, if the tolerationKey is set to accelerator (the default) and the packaged model’s gpuType is set to Tesla-T4, then MLIS generates the following:

 tolerations:
  - effect: NoSchedule
    key: accelerator
    operator: Equal
    value:  Tesla-T4

Install or Upgrade via Helm #

After configuring the Helm chart, perform a Helm upgrade to apply the changes:

helm upgrade mlis \
  --set 'global.imagePullSecrets[0].name=regcred' \
  --set 'global.imagePullSecrets[1].name=hpe-mlis-registry' \
  --set imageRegistry=hub.myenterpriselicense.hpe.com/hpe-mlis/<SKU> \
  --set defaultPassword=<CREATE_ADMIN_PASSWORD> \
  --values values.yaml \ 
  <SKU>_aioli-helm-chart<release/majorMinorPatchNumber>}}.tgz

Test #

GKE #

The following assumes that gpuSelector.gke: true is set in the Helm chart’s values.yaml file.

Test the setup by deploying a packaged model configured with the --gpu-type nvidia-tesla-a100 argument. The platform should then schedule the inference service on a node labeled with cloud.google.com/gke-accelerator=nvidia-tesla-a100.

You can see what nodes the pod may be scheduled on by checking the pod’s labels:

kubectl get pods --show-labels

To see what pods can use those nodes, check the node’s labels:

  nodeSelector:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100

You can display the podnames and their node labels together with the following command:

kubectl get pods -Ao jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeSelector}{"\n"}{end}' | grep nvidia-tesla-a100
starcoder-predictor-00004-deployment-5c75bfbdbb-mcbz6 {"cloud.google.com/gke-accelerator":"nvidia-tesla-a100"}

Taints & Tolerations #

You can check the tolerations for a specific inference service (or pod) by running the following command:

kubectl get inferenceservices.serving.kserve.io -o yaml starcoder | grep -A4 tolerations

tolerations:
- effect: NoSchedule
  key: accelerator
  operator: Equal
  value: nvidia-tesla-a100

Alternatively, you can check the tolerations for all pods by running the following command:

kubectl get pods -Ao jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.tolerations}{"\n"}{end}' | grep nvidia-tesla-a100
starcoder-predictor-00004-deployment-5c75bfbdbb-mcbz6 [{"effect":"NoSchedule","key":"accelerator","operator":"Equal","value":"nvidia-tesla-a100"},{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":300},{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":300},{"effect":"NoSchedule","key":"nvidia.com/gpu","operator":"Exists"}]