Create Deployment

By creating a deployment, you are deploying an inference service to a Kubernetes cluster. The end result is a service with an available endpoint that can be accessed by clients to make predictions.

Before You Start

Ensure that you have already:

warning icon Ephemeral Storage Considerations

Default disk sizes on cloud providers may not be sufficient for large models, which often require significant ephemeral storage. This can result in the inference service failing to start serving—in some instances, without providing an error message about being out of disk space.

Ephemeral storage is normally provided by the boot disk of the compute nodes. You can inspect the amount of ephemeral storage on your nodes using the kubectl describe node <node-name> command.


How to Create a Deployment

tip icon Improve Startup Time
For optimal startup time, use a custom container image. This combines code and model in one container, eliminating model download time, which is important for scaling inference services. You can build a custom container image using BentoML or OpenLLM’s build --containerize command, then reference it from a packaged model using the custom model format and the image container.

Via the UI

  1. Sign in to HPE Machine Learning Inferencing Software.

  2. Navigate to Deployments.

  3. Select Create new deployment.

  4. Provide a Deployment Name.

  5. Select Next.

  6. Choose a Packaged Model from the dropdown.

  7. Select Next

  8. Select a Kubernetes Namespace from the dropdown. If this option is not available, the Namespace is already set for you.

  9. Optionally, toggle endpoint security. If enabled, all endpoint interactions will required a deployment token in the header (e.g., “Authorization: Bearer <YOUR_ACCESS_TOKEN>”). In some cases, endpoint security is enabled by default and cannot be disabled.

  10. Select Next.

  11. Optionally, specify a node selector label and value, and select a deployment priority (not available to all clusters). Select Next.

  12. Choose an Auto scaling targets template or provide custom values for all of the following:

    • Auto scaling target templates:
      namedescriptionautoscaling_min_replicasautoscaling_max_replicasautoscaling_metricautoscaling_target
      fixed-1One inference service replica, always available.11rps0
      fixed-2Two inference service replicas, always available.22rps0
      scale-0-to-1-concurrency-3Scale from 0 to 1 replicas with metric concurrency 3.01concurrency3
      scale-0-to-4-rps-10Scale from 0 to 4 replicas metric with requests-per-second 10.04rps10
      scale-0-to-8-rps-20Scale from 0 to 8 replicas metric with requests-per-second 20.08rps20
      scale-1-to-4-rps-10Scale from 1 to 4 replicas metric with requests-per-second 10.14rps10
      scale-1-to-8-concurrency-3Scale from 1 to 8 replicas with metric concurrency 3.18concurrency3
    • Minimum instances: The minimum number of instances to run.
    • Maximum instances: The maximum number of instances to run.
    • Auto scaling target: The target metric and metric-value to trigger scaling. The possible metric types that can be configured per revision depend on the type of Autoscaler implementation you are using:
  13. Select Next.

  14. Provide any needed Environment Variables or Arguments. See the Advanced Configuration Options reference article for more information.

  15. Select Done.

Via the CLI

  1. Sign into HPE Machine Learning Inferencing Software.
    aioli user login <YOUR_USERNAME>
  2. Create a new deployment with the following command:
    aioli deployment create <DEPLOYMENT_NAME> \
       --model <PACKAGED_MODEL_NAME> \
       --namespace <K8S_NAMESPACE> \
       --authentication-required <BOOLEAN> \
       --auto-scaling-max-replicas <MAX_REPLICAS> \
       --auto-scaling-min-replicas <MIN_REPLICAS> \
       --auto-scaling-metric <METRIC> \
       --auto-scaling-target <TARGET_VALUE> \
       --environment <VAR_1>=<VALUE_1> <VAR_2>=<VALUE_2> \
       --arguments <ARG_1> <ARG_2> 
  3. Wait for your deployment to reach Ready state.
    aioli deployment show <DEPLOYMENT_NAME>

Via the API

  1. Sign in to HPE Machine Learning Inferencing Software.
    curl -X 'POST' \
      '<YOUR_EXT_CLUSTER_IP>/api/v1/login' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
      "username": "<YOUR_USERNAME>",
      "password": "<YOUR_PASSWORD>"
    }'
  2. Obtain the Bearer token from the response.
  3. Use the following cURL command to add a new deployment.
    curl -X 'POST' \
         '<YOUR_EXT_CLUSTER_IP>/api/v1/deployments' \
         -H 'accept: application/json' \
         -H 'Content-Type: application/json' \
         -H 'Authorization: Bearer <YOUR_ACCESS_TOKEN>' \
         -d '{
           "arguments": [
             "--debug"
           ],
           "autoScaling": {
             "maxReplicas": <MAX_REPLICAS>,
             "metric": "<METRIC>",
             "minReplicas": <MIN_REPLICAS>,
             "target": <TARGET_VALUE>
           },
           "canaryTrafficPercent": <CANARY_PERCENT>,
           "environment": {
             "<VAR_1>": "<VALUE_1>",
             "<VAR_2>": "<VALUE_2>"
           },
           "goalStatus": "<GOAL_STATUS>",
           "model": "<PACKAGED_MODEL_NAME>",
           "name": "<DEPLOYMENT_NAME>",
           "namespace": "<K8S_NAMESPACE>",
           "security": {
             "authenticationRequired": <BOOLEAN>
           }
         }'