Failed Deployment

Deploying an inference service can fail for various reasons. This guide helps you troubleshoot common issues that may cause a deployment to fail.

Before You Start

Use kubectl to describe the revision and the deployment to get more information about the failure.

kubectl describe revision revision.serving.knative/<SERVICE_NAME>
kubectl describe deployment <DEPLOYMENT_NAME>
Ephemeral Storage Considerations

Default disk sizes on cloud providers may not be sufficient for large models, which often require significant ephemeral storage. This can result in the inference service failing to start serving—in some instances, without providing an error message about being out of disk space.

Ephemeral storage is normally provided by the boot disk of the compute nodes. You can inspect the amount of ephemeral storage on your nodes using the kubectl describe node <node-name> command.


Errors

FailedToRetrieveImagePullSecret

Expand

Insufficient nvidia.com/gpu

Expand

Internal Error

Expand

Model Load Failed

Expand

RevisionMissing, RevisionFailed

Expand