Create Deployment
By creating a deployment, you are deploying an inference service to a Kubernetes cluster. The end result is a service with an available endpoint that can be accessed by clients to make predictions.
Before You Start #
Ensure that you have already:
- Added a registry and uploaded a model image to that registry
- Added a packaged model
Default disk sizes on cloud providers may not be sufficient for large models, which often require significant ephemeral storage. This can result in the inference service failing to start serving—in some instances, without providing an error message about being out of disk space.
Ephemeral storage is normally provided by the boot disk of the compute nodes. You can inspect the amount of ephemeral storage on your nodes using the kubectl describe node <node-name>
command.
How to Create a Deployment #
build --containerize
command, then reference it from a packaged model using the custom model format and the image container.Via the UI #
-
Sign in to HPE Machine Learning Inferencing Software.
-
Navigate to Deployments.
-
Select Create new deployment.
-
Provide a Deployment Name.
-
Select Next.
-
Choose a Packaged Model from the dropdown.
-
Select Next
-
Select a Kubernetes Namespace from the dropdown. If this option is not available, the Namespace is already set for you.
-
Optionally, toggle endpoint security. If enabled, all endpoint interactions will required a deployment token in the header (e.g., “Authorization: Bearer <YOUR_ACCESS_TOKEN>”). In some cases, endpoint security is enabled by default and cannot be disabled.
-
Select Next.
-
Optionally, specify a node selector label and value, and select a deployment priority (not available to all clusters). Select Next.
-
Choose an Auto scaling targets template or provide custom values for all of the following:
- Auto scaling target templates:
name description autoscaling_min_replicas autoscaling_max_replicas autoscaling_metric autoscaling_target fixed-1 One inference service replica, always available. 1 1 rps 0 fixed-2 Two inference service replicas, always available. 2 2 rps 0 scale-0-to-1-concurrency-3 Scale from 0 to 1 replicas with metric concurrency 3. 0 1 concurrency 3 scale-0-to-4-rps-10 Scale from 0 to 4 replicas metric with requests-per-second 10. 0 4 rps 10 scale-0-to-8-rps-20 Scale from 0 to 8 replicas metric with requests-per-second 20. 0 8 rps 20 scale-1-to-4-rps-10 Scale from 1 to 4 replicas metric with requests-per-second 10. 1 4 rps 10 scale-1-to-8-concurrency-3 Scale from 1 to 8 replicas with metric concurrency 3. 1 8 concurrency 3 - Minimum instances: The minimum number of instances to run.
- Maximum instances: The maximum number of instances to run.
- Auto scaling target: The target metric and
metric-value
to trigger scaling. The possible metric types that can be configured per revision depend on the type of Autoscaler implementation you are using:- The default KPA Autoscaler supports the
concurrency
andrps
metrics. - The HPA Autoscaler supports the
cpu
metric.
- The default KPA Autoscaler supports the
- Auto scaling target templates:
-
Select Next.
-
Provide any needed Environment Variables or Arguments. See the Advanced Configuration Options reference article for more information.
-
Select Done.
Via the CLI #
- Sign into HPE Machine Learning Inferencing Software.
aioli user login <YOUR_USERNAME>
- Create a new deployment with the following command:
aioli deployment create <DEPLOYMENT_NAME> \ --model <PACKAGED_MODEL_NAME> \ --namespace <K8S_NAMESPACE> \ --authentication-required <BOOLEAN> \ --auto-scaling-max-replicas <MAX_REPLICAS> \ --auto-scaling-min-replicas <MIN_REPLICAS> \ --auto-scaling-metric <METRIC> \ --auto-scaling-target <TARGET_VALUE> \ --environment <VAR_1>=<VALUE_1> <VAR_2>=<VALUE_2> \ --arguments <ARG_1> <ARG_2>
- Wait for your deployment to reach
Ready
state.aioli deployment show <DEPLOYMENT_NAME>
- For more information on the
aioli deployment create
command, see the CLI command reference. - For information on setting environment variables and arguments, see the Advanced Configuration reference article.
Via the API #
- Sign in to HPE Machine Learning Inferencing Software.
curl -X 'POST' \ '<YOUR_EXT_CLUSTER_IP>/api/v1/login' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "username": "<YOUR_USERNAME>", "password": "<YOUR_PASSWORD>" }'
- Obtain the Bearer token from the response.
- Use the following cURL command to add a new deployment.
curl -X 'POST' \ '<YOUR_EXT_CLUSTER_IP>/api/v1/deployments' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer <YOUR_ACCESS_TOKEN>' \ -d '{ "arguments": [ "--debug" ], "autoScaling": { "maxReplicas": <MAX_REPLICAS>, "metric": "<METRIC>", "minReplicas": <MIN_REPLICAS>, "target": <TARGET_VALUE> }, "canaryTrafficPercent": <CANARY_PERCENT>, "environment": { "<VAR_1>": "<VALUE_1>", "<VAR_2>": "<VALUE_2>" }, "goalStatus": "<GOAL_STATUS>", "model": "<PACKAGED_MODEL_NAME>", "name": "<DEPLOYMENT_NAME>", "namespace": "<K8S_NAMESPACE>", "security": { "authenticationRequired": <BOOLEAN> } }'