Create Deployment

By creating a deployment, you are deploying an inference service to a Kubernetes cluster. The end result is a service with an available endpoint that can be accessed by clients to make predictions.

Before You Start

Ensure that you have already:

Ephemeral Storage Considerations

Default disk sizes on cloud providers may not be sufficient for large models, which often require significant ephemeral storage. This can result in the inference service failing to start serving—in some instances, without providing an error message about being out of disk space.

Ephemeral storage is normally provided by the boot disk of the compute nodes. You can inspect the amount of ephemeral storage on your nodes using the kubectl describe node <node-name> command.


How to Create a Deployment

Improve Startup Time
For optimal startup time, use a custom container image. This combines code and model in one container, eliminating model download time, which is important for scaling inference services. You can build a custom container image using BentoML or OpenLLM’s build --containerize command, then reference it from a packaged model using the custom model format and the image container.

Via the UI

  1. Sign in to HPE Machine Learning Inferencing Software.

  2. Navigate to Deployments.

  3. Select Create new deployment.

  4. Provide a Deployment Name.

  5. Select Next.

  6. Choose a Packaged Model from the dropdown.

  7. Select Next

  8. Select a Kubernetes Namespace from the dropdown.

  9. Optionally, require endpoint security by toggling the switch. When enabled, all interactions with the deployment’s endpoint must include an access token in the header (e.g., “Authorization: Bearer <YOUR_ACCESS_TOKEN>”). Visit deployment tokens for more info on how to set up a deployment token.

  10. Select Next.

  11. Choose an Auto scaling targets template or provide custom values for all of the following:

    • Auto scaling target templates:
      name description autoscaling_min_replicas autoscaling_max_replicas autoscaling_metric autoscaling_target
      fixed-1 One inference service replica, always available. 1 1 rps 0
      fixed-2 Two inference service replicas, always available. 2 2 rps 0
      scale-0-to-1-concurrency-3 Scale from 0 to 1 replicas with metric concurrency 3. 0 1 concurrency 3
      scale-0-to-4-rps-10 Scale from 0 to 4 replicas metric with requests-per-second 10. 0 4 rps 10
      scale-0-to-8-rps-20 Scale from 0 to 8 replicas metric with requests-per-second 20. 0 8 rps 20
      scale-1-to-4-rps-10 Scale from 1 to 4 replicas metric with requests-per-second 10. 1 4 rps 10
      scale-1-to-8-concurrency-3 Scale from 1 to 8 replicas with metric concurrency 3. 1 8 concurrency 3
    • Minimum instances: The minimum number of instances to run.
    • Maximum instances: The maximum number of instances to run.
    • Auto scaling target: The target metric and metric-value to trigger scaling. The possible metric types that can be configured per revision depend on the type of Autoscaler implementation you are using:
  12. Select Next.

  13. Provide any needed Environment Variables or Arguments. See the Advanced Configuration Options reference article for more information.

  14. Select Done.

Via the CLI

  1. Sign into HPE Machine Learning Inferencing Software.
    aioli login <YOUR_USERNAME>
  2. Create a new deployment with the following command:
    aioli deployment create <DEPLOYMENT_NAME> \
       --model <PACKAGED_MODEL_NAME> \
       --namespace <K8S_NAMESPACE> \
       --authentication-required <BOOLEAN> \
       --auto-scaling-max-replicas <MAX_REPLICAS> \
       --auto-scaling-min-replicas <MIN_REPLICAS> \
       --auto-scaling-metric <METRIC> \
       --auto-scaling-target <TARGET_VALUE> \
       --environment <VAR_1>=<VALUE_1> <VAR_2>=<VALUE_2> \
       --arguments <ARG_1> <ARG_2> 
  3. Wait for your deployment to reach Ready state.
    aioli deployment show <DEPLOYMENT_NAME>

Via the API

  1. Sign in to HPE Machine Learning Inferencing Software.
    curl -X 'POST' \
      '<YOUR_EXT_CLUSTER_IP>/api/v1/login' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
      "username": "<YOUR_USERNAME>",
      "password": "<YOUR_PASSWORD>"
    }'
  2. Obtain the Bearer token from the response.
  3. Use the following cURL command to add a new deployment.
    curl -X 'POST' \
         '<YOUR_EXT_CLUSTER_IP>/api/v1/deployments' \
         -H 'accept: application/json' \
         -H 'Content-Type: application/json' \
         -H 'Authorization: Bearer <YOUR_ACCESS_TOKEN>' \
         -d '{
           "arguments": [
             "--debug"
           ],
           "autoScaling": {
             "maxReplicas": <MAX_REPLICAS>,
             "metric": "<METRIC>",
             "minReplicas": <MIN_REPLICAS>,
             "target": <TARGET_VALUE>
           },
           "canaryTrafficPercent": <CANARY_PERCENT>,
           "environment": {
             "<VAR_1>": "<VALUE_1>",
             "<VAR_2>": "<VALUE_2>"
           },
           "goalStatus": "<GOAL_STATUS>",
           "model": "<PACKAGED_MODEL_NAME>",
           "name": "<DEPLOYMENT_NAME>",
           "namespace": "<K8S_NAMESPACE>",
           "security": {
             "authenticationRequired": <BOOLEAN>
           }
         }'