Create Deployment

By creating a deployment, you are deploying an inference service to a Kubernetes cluster. The end result is a service with an available endpoint that can be accessed by clients to make predictions.

Before You Start #

Ensure that you have already:

Added a registry and uploaded a model image to that registry
Added a packaged model

Ephemeral Storage Considerations

Default disk sizes on cloud providers may not be sufficient for large models, which often require significant ephemeral storage. This can result in the inference service failing to start serving—in some instances, without providing an error message about being out of disk space.

Ephemeral storage is normally provided by the boot disk of the compute nodes. You can inspect the amount of ephemeral storage on your nodes using the kubectl describe node <node-name> command.

How to Create a Deployment #

Improve Startup Time

For optimal startup time, use a custom container image. This combines code and model in one container, eliminating model download time, which is important for scaling inference services. You can build a custom container image using BentoML or OpenLLM’s build --containerize command, then reference it from a packaged model using the custom model format and the image container.

Via the UI #

Sign in to HPE Machine Learning Inferencing Software.
Navigate to Deployments.
Select Create new deployment.
Provide a Deployment Name.
Select Next.
Choose a Packaged Model from the dropdown.
Select Next
Select a Kubernetes Namespace from the dropdown. If this option is not available, the Namespace is already set for you.
Optionally, toggle endpoint security. If enabled, all endpoint interactions will required a deployment token in the header (e.g., “Authorization: Bearer <YOUR_ACCESS_TOKEN>”). In some cases, endpoint security is enabled by default and cannot be disabled.
Select Next.
Optionally, specify a node selector label and value, and select a deployment priority (not available to all clusters). Select Next.

Choose an Auto scaling targets template or provide custom values for all of the following:

Auto scaling target templates:

name	description	autoscaling_min_replicas	autoscaling_max_replicas	autoscaling_metric	autoscaling_target
fixed-1	One inference service replica, always available.	1	1	rps	0
fixed-2	Two inference service replicas, always available.	2	2	rps	0
scale-0-to-1-concurrency-3	Scale from 0 to 1 replicas with metric concurrency 3.	0	1	concurrency	3
scale-0-to-4-rps-10	Scale from 0 to 4 replicas metric with requests-per-second 10.	0	4	rps	10
scale-0-to-8-rps-20	Scale from 0 to 8 replicas metric with requests-per-second 20.	0	8	rps	20
scale-1-to-4-rps-10	Scale from 1 to 4 replicas metric with requests-per-second 10.	1	4	rps	10
scale-1-to-8-concurrency-3	Scale from 1 to 8 replicas with metric concurrency 3.	1	8	concurrency	3

Minimum instances: The minimum number of instances to run.
Maximum instances: The maximum number of instances to run.
Auto scaling target: The target metric and metric-value to trigger scaling. The possible metric types that can be configured per revision depend on the type of Autoscaler implementation you are using:
- The default KPA Autoscaler supports the concurrency and rps metrics.
- The HPA Autoscaler supports the cpu metric.

Select Next.
Provide any needed Environment Variables or Arguments. See the Advanced Configuration Options reference article for more information.
Select Done.

Via the CLI #

Sign into HPE Machine Learning Inferencing Software.
```
aioli user login <YOUR_USERNAME>
```

Create a new deployment with the following command:

aioli deployment create <DEPLOYMENT_NAME> \
   --model <PACKAGED_MODEL_NAME> \
   --namespace <K8S_NAMESPACE> \
   --authentication-required <BOOLEAN> \
   --auto-scaling-max-replicas <MAX_REPLICAS> \
   --auto-scaling-min-replicas <MIN_REPLICAS> \
   --auto-scaling-metric <METRIC> \
   --auto-scaling-target <TARGET_VALUE> \
   --environment <VAR_1>=<VALUE_1> <VAR_2>=<VALUE_2> \
   --arguments <ARG_1> <ARG_2>

Wait for your deployment to reach Ready state.
```
aioli deployment show <DEPLOYMENT_NAME>
```

For more information on the aioli deployment create command, see the CLI command reference.
For information on setting environment variables and arguments, see the Advanced Configuration reference article.

Via the API #

curl -X 'POST' \
  '<YOUR_EXT_CLUSTER_IP>/api/v1/login' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "username": "<YOUR_USERNAME>",
  "password": "<YOUR_PASSWORD>"
}'

Obtain the Bearer token from the response.

Use the following cURL command to add a new deployment.

curl -X 'POST' \
     '<YOUR_EXT_CLUSTER_IP>/api/v1/deployments' \
     -H 'accept: application/json' \
     -H 'Content-Type: application/json' \
     -H 'Authorization: Bearer <YOUR_ACCESS_TOKEN>' \
     -d '{
       "arguments": [
         "--debug"
       ],
       "autoScaling": {
         "maxReplicas": <MAX_REPLICAS>,
         "metric": "<METRIC>",
         "minReplicas": <MIN_REPLICAS>,
         "target": <TARGET_VALUE>
       },
       "canaryTrafficPercent": <CANARY_PERCENT>,
       "environment": {
         "<VAR_1>": "<VALUE_1>",
         "<VAR_2>": "<VALUE_2>"
       },
       "goalStatus": "<GOAL_STATUS>",
       "model": "<PACKAGED_MODEL_NAME>",
       "name": "<DEPLOYMENT_NAME>",
       "namespace": "<K8S_NAMESPACE>",
       "security": {
         "authenticationRequired": <BOOLEAN>
       }
     }'