Inference Service Deployment Quickstart

This guide walks you through the entire service deployment journey, from registry setup to requesting a prediction from your deployed service.

Before You Start


How to Launch a Service Deployment

1. Set Up a Registry

For this quickstart, we’ll use Hugging Face as our OpenLLM-compatible registry. For other registry setups, see the Registry documentation.

  1. Go to the Hugging Face website and sign in to your account.
  2. Navigate to your Profile and select Settings.
  3. Go to Access Tokens and select New Token.
  4. Fill in the following details:
    • Name: A name for your token.
    • Type: Select read.
  5. Select Generate a Token and copy it.

2. Add a Registry

Now, add the Hugging Face registry to the platform.

  1. Sign in to HPE Machine Learning Inferencing Software.
  2. Go to Registries and select Add new registry.
  3. Provide the following details:
    • Name: e.g., hf-registry.
    • Type: Choose OpenLLM from the dropdown.
    • Hugging Face Token: Paste the token from the previous step.
  4. Select Create registry.

3. Add a Packaged Model

Hugging Face provides a wide range of pre-trained models that you can use. For this quickstart, we’ll use the facebook/opt-125m model.

  1. Go to Packaged models and select Add new model.
  2. Provide the following details:
    • Name: The model name within HPE Machine Learning Inferencing Software.
    • Description: A brief model description.
  3. Select Next.
  4. Registry: Select the registry created previously.
  5. In Pick a model from a list, choose the facebook/opt-125m model and select Next.
  6. From the Resource Template dropdown, choose gpu-tiny and select Next.
  7. Skip the Environment Variables and Arguments and select Create model.

4. Create a Deployment

Deploy the model as an inference service on your Kubernetes cluster.

A deployment is the final step in the process. It’s the actual instantiation of the inference service on your Kubernetes cluster. Once you’ve created a deployment, you can start sending requests to the service’s endpoint.

  1. Go to Deployments and select Create new deployment.
  2. Provide a Deployment Name and select Next.
  3. From the Packaged Model dropdown, choose the model you created earlier and select Next.
  4. Select a Kubernetes Namespace (e.g., default) and select Next.
  5. From the Scaling Template dropdown, choose fixed-1 and select Next.
  6. Provide any needed Environment Variables or Arguments and select Done.

Wait a few minutes for the deployment to reach a Ready status. Once it’s ready, you can start sending requests to the service’s endpoint.

5. Send a Request to the Service

  1. Open the deployment you created and copy its Endpoint URL.
  2. Send a request to the endpoint using the following openllm command:
    openllm query --timeout=600 --endpoint <DEPLOYMENT_ENDPOINT>  "What is aioli?"
    What is aioli?
    It's a kind of "coconut milk" that has been made with a yeast called aioli.  It's pretty sweet.  I have it in my fridge.