Object Model Reference Schema

Description and control of your inference service is accomplished via two primary objects: Packaged Model and Deployment. Additionally, when referencing a Package Model via an external hosting method such as S3 bucket, or huggingface.co registry, you may need to configure a Registry to enable that access.

Deployment

The Deployment object controls when a Packaged Model is deployed. The Deployment object has the following attributes:

Input Attributes

name: The name of the deployment to enable access to it via the REST interface or CLI. This is the name used in the associated KServe inference service that will be created.
namespace: The Kubernetes namespace into which the service is deployed. It must already exist.
model: The name (or id) of the packaged model to be deployed.
security: Encapsulates the security option (authenticationRequired) for the deployed service.
autoScaling: Controls the scaling limits minReplicas/maxReplicas, metric to control scaling, and the target value.
canaryTrafficPercent: The percentage of traffic to route to this particular model version. The default is 100.
goalStatus: Specifies the intended status to be achieved by the deployment. The default is Ready.
- Ready: The inference service will be deployed to enable inference calls.
- Paused: The inference service will be stopped and no longer accept calls.
environment: Environment variables to be provided to the container image when started.
arguments: Arguments to be passed to the container image when started. These are in addition to any configured on the packaged model.

Managed Attributes

id: A unique identifier to identify this service.
status: Summary status of the deployed service.
- Deploying: Service configuration is in progress.
- Failed: The service configuration failed.
- Ready: The service has been successfully configured and is serving.
- Updating: A new service revision is being rolled out.
- UpdateFailed: The current service revision failed to roll out due to an error. The prior version is still serving requests.
- Deleting: The deployed service is being removed.
- Paused: The deployed service has been stopped by the user or an external action.
- Unknown: Unable to determine the status.
- Canceled: The deployed service has been canceled.
state: State details of the current service configuration requested. See the DeploymentStateDetails component for details.
secondaryState: State details of a prior service configuration until the currently requested configuration has been fully rolled out.

PackagedModel

The Packaged Model object identifies the model and code that make up your inference service. The code may be provided via a container image, or via an external hosting method (S3 or huggingface.co registry). The Packaged Model object has the following attributes:

Input Attributes

name: The name of the model.
description: A text description of the model.
modelFormat: Model format for downloaded models (e.g. from S3, http, etc.).
- custom: The packaged model is provided in a container image.
- bento-archive: The packaged model is a bento archive file (.bento). It will be downloaded, expanded, and then will be served using the bentolm serve command in a provided bentoml serving container.
- openllm: The packaged model will be served using the openllm serve command in a provided openllm serving container.
- nim: The packaged model will be served using the specified NVIDIA NIM microservices container image.
registry: The name or id of a registry object. If the model data is not provided via a container image, this must be specified.
url: Reference to the Bento or model to be served.
- openllm: An OpenLLM model name from huggingface.co dynamically loaded and executed with a VLLM backend.
- s3: An OpenLLM model path which will be dynamically downloaded from an associated S3 registry bucket.
- ngc: An NVIDIA NGC model will be dynamically downloaded from the associated ngc registry bucket and executed with the specified NVIDIA NIM microservices container image.
resources: The resource requirements for running the service (requests/limits) for cpu/memory/gpu.
image: The containerized bento where the inference service is to be deployed.
environment: Environment variables to be provided to the container image when started. See Packaged Model Environment Variables for a list of default options.
arguments: Arguments to be passed to the container image when started.

Managed Attributes

id: A unique identifier to identify this particular model version.
version: An automatically incrementing integer version of the model as you make changes.

Registry

The Registry object provides the metadata that describes how to download a Packaged Model for deployment.

Input Attributes

name: The name of service to enable access to it via the REST interface or CLI.
description: A text description of the service.
type: The type of this model registry.
- s3: Configuration to enable access to an S3 bucket.
- openllm: Configuration to enable direct download of openllm models from huggingface.co. Provide your access token in the secretKey field.
- ngc: Configuration to enable direct download from the NGC: AI Development Catalog.
endpointUrl: The registry endpoint (host).
- s3: The S3 registry endpoint for the associated S3 region. Required.
- openllm: The huggingface.co-compatible endpoint (default https://huggingface.co).
- ngc: The NVIDIA NGC-compatible api endpoint (default https://api.ngc.nvidia.com).
bucket: The bucket or organization name, depending on which of the following values is selected as the model registry type.
- s3: The required S3 bucket name.
- ngc: The required NGC org name.
accessKey: The access key, username or team name for the registry.
- s3: The access key/username.
- ngc: The optional NGC team name.
secretKey: The secret key is the password, secret key, or access token for the registry.
- s3: The secretKey provides a secret key for the S3 bucket.
- openllm: The secretKey is the access token for huggingface.co and is supplied to the launched container via the HF_TOKEN environment variable.
- ngc: The NVIDIA NGC apikey.
insecureHttps: For https endpoints, the server certificate will be accepted without validation.

Managed Attributes

id: A unique identifier to identify this service.

DeploymentStateDetails

The state details of an inference service Deployment are described with the following attributes:

Attributes

endpoint: The endpoint uri used to access the inference service.
nativeAppName: The name of the Kubernetes application for the specific service version. Use this name to match the app value in Grafana/Prometheus to obtain logs and metrics for this deployed service version.
status: The status of a particular inference service revision.
- Deploying: The service configuration is in progress.
- Failed: The service configuration failed.
- Ready: The service has been successfully configured and is serving.
- Updating: A new service revision is being rolled out.
- UpdateFailed: The current service revision failed to rollout due to an error. The prior version is still serving requests.
- Deleting: The service is being removed.
- Paused: The service has been stopped by the user or an external action.
- Unknown: Unable to determined the service’s status.
- Canceled: The specified model version of the deployment was canceled by the user.
trafficPercentage: Percent of traffic being processed by this service/model version.
failureInfo: A list of any failures associated with the deployment of this service/model version.
modelId: The id of the deployed packaged model associated with this state.