Inference Service Deployment Quickstart

This guide walks you through the entire service deployment journey, from registry setup to requesting a prediction from your deployed service.

Before You Start #

Ensure you have completed the platform deployment quickstart steps if necessary.

How to Launch a Service Deployment #

1. Set Up a Registry #

For this quickstart, we’ll use Hugging Face as our OpenLLM-compatible registry. For other registry setups, see the Registry documentation.

Go to the Hugging Face website and sign in to your account.
Navigate to your Profile and select Settings.
Go to Access Tokens and select New Token.
Fill in the following details:
- Name: Example: my-hf-token.
- Type: Select read.
Select Generate a Token and copy it.

2. Add a Registry #

Now, add the Hugging Face registry to the platform.

Sign in to HPE Machine Learning Inferencing Software.
Go to Registries and select Add new registry.
Provide the following details:
- Name: e.g., hf-registry.
- Type: Choose OpenLLM from the dropdown.
- Hugging Face Token: Paste the token from the previous step.
Select Create registry.

3. Add a Packaged Model #

Hugging Face provides a wide range of pre-trained models that you can use. For this quickstart, we’ll use the facebook/opt-125m model.

Go to Packaged models and select Add new model.
Provide the following details:
- Name: Example: opt-125m-inference-model.
- Description: Example: OPT-125M model for inference tasks.
Select Next.
Registry: Select the registry you created previously.
In Pick a model from a list, choose the facebook/opt-125m model.
Select the default image by leaving the image (advanced) field blank and select Next.
From the Resource Template dropdown, choose gpu-tiny and select Next.
Skip the Environment Variables and Arguments and select Create model.

4. Create a Deployment #

Deploy the model as an inference service on your Kubernetes cluster.

A deployment is the final step in the process. It’s the actual instantiation of the inference service on your Kubernetes cluster. Once you’ve created a deployment, you can start sending requests to the service’s endpoint.

Go to Deployments and select Create new deployment.
Provide a Deployment Name, for example: opt-inference-deployment, and select Next.
From the Packaged Model dropdown, choose the model you created earlier and select Next.
Namespace Choose default, or your selected namespace. If this option is not available, the Namespace is already set for you.
Optionally, toggle endpoint security. If enabled, all endpoint interactions will required a deployment token in the header (e.g., “Authorization: Bearer <YOUR_ACCESS_TOKEN>”). In some cases, endpoint security is enabled by default and cannot be disabled.
Select Next twice to skip to the Scaling tab.
From the Auto scaling targets template dropdown, choose fixed-1 and select Next.
Provide any needed Environment Variables or Arguments and select Done.

Wait a few minutes for the deployment to reach a Ready status. Once it’s ready, you can start sending requests to the service’s endpoint.

5. Send a Request to the Service #

Open the deployment you created and copy its Endpoint URL.

Send a request to the endpoint using the following openllm command:

openllm query --timeout=600 --endpoint <DEPLOYMENT_ENDPOINT>  "What is aioli?"

What is aioli?
It's a kind of "coconut milk" that has been made with a yeast called aioli.  It's pretty sweet.  I have it in my fridge.