Inference Service Deployment Quickstart
This guide walks you through the entire service deployment journey, from registry setup to requesting a prediction from your deployed service.
Before You Start #
- Ensure you have completed the platform deployment quickstart steps if necessary.
How to Launch a Service Deployment #
1. Set Up a Registry #
For this quickstart, we’ll use Hugging Face as our OpenLLM-compatible registry. For other registry setups, see the Registry documentation.
- Go to the Hugging Face website and sign in to your account.
- Navigate to your Profile and select Settings.
- Go to Access Tokens and select New Token.
- Fill in the following details:
- Name: Example:
my-hf-token
. - Type: Select
read
.
- Name: Example:
- Select Generate a Token and copy it.
2. Add a Registry #
Now, add the Hugging Face registry to the platform.
- Sign in to HPE Machine Learning Inferencing Software.
- Go to Registries and select Add new registry.
- Provide the following details:
- Name: e.g.,
hf-registry
. - Type: Choose
OpenLLM
from the dropdown. - Hugging Face Token: Paste the token from the previous step.
- Name: e.g.,
- Select Create registry.
3. Add a Packaged Model #
Hugging Face provides a wide range of pre-trained models that you can use. For this quickstart, we’ll use the facebook/opt-125m
model.
- Go to Packaged models and select Add new model.
- Provide the following details:
- Name: Example:
opt-125m-inference-model
. - Description: Example:
OPT-125M model for inference tasks
.
- Name: Example:
- Select Next.
- Registry: Select the registry you created previously.
- In Pick a model from a list, choose the
facebook/opt-125m
model. - Select the default image by leaving the image (advanced) field blank and select Next.
- From the Resource Template dropdown, choose
gpu-tiny
and select Next. - Skip the Environment Variables and Arguments and select Create model.
4. Create a Deployment #
Deploy the model as an inference service on your Kubernetes cluster.
A deployment is the final step in the process. It’s the actual instantiation of the inference service on your Kubernetes cluster. Once you’ve created a deployment, you can start sending requests to the service’s endpoint.
- Go to Deployments and select Create new deployment.
- Provide a Deployment Name, for example:
opt-inference-deployment
, and select Next. - From the Packaged Model dropdown, choose the model you created earlier and select Next.
- Namespace Choose
default
, or your selected namespace. If this option is not available, the Namespace is already set for you. - Optionally, toggle endpoint security. If enabled, all endpoint interactions will required a deployment token in the header (e.g., “Authorization: Bearer <YOUR_ACCESS_TOKEN>”). In some cases, endpoint security is enabled by default and cannot be disabled.
- Select Next twice to skip to the Scaling tab.
- From the Auto scaling targets template dropdown, choose
fixed-1
and select Next. - Provide any needed Environment Variables or Arguments and select Done.
Wait a few minutes for the deployment to reach a Ready
status. Once it’s ready, you can start sending requests to the service’s endpoint.
5. Send a Request to the Service #
- Open the deployment you created and copy its Endpoint URL.
- Send a request to the endpoint using the following
openllm
command:openllm query --timeout=600 --endpoint <DEPLOYMENT_ENDPOINT> "What is aioli?"
What is aioli? It's a kind of "coconut milk" that has been made with a yeast called aioli. It's pretty sweet. I have it in my fridge.