Failed Deployment

Deploying an inference service can fail for various reasons. This guide helps you troubleshoot common issues that may cause a deployment to fail.

Before You Start #

Use kubectl to describe the revision and the deployment to get more information about the failure.

kubectl describe revision revision.serving.knative/<SERVICE_NAME>
kubectl describe deployment <DEPLOYMENT_NAME>

Ephemeral Storage Considerations

Default disk sizes on cloud providers may not be sufficient for large models, which often require significant ephemeral storage. This can result in the inference service failing to start serving—in some instances, without providing an error message about being out of disk space.

Ephemeral storage is normally provided by the boot disk of the compute nodes. You can inspect the amount of ephemeral storage on your nodes using the kubectl describe node <node-name> command.

Errors #

Insufficient nvidia.com/gpu #

Deploying a service with GPU request and limits of 1 on a cluster without nodes that contain a GPU will fail to deploy and remain in the deploying state. Using kubectl to describe the revision will show the status with the following messages:

Revision Output Example #

...
    Resources:
      Limits:
        Cpu:             1
        Memory:          1Gi
        nvidia.com/gpu:  1
      Requests:
        Cpu:             1
        Memory:          1Gi
        nvidia.com/gpu:  1
...
Status:
  Actual Replicas:  0
  Conditions:
    Last Transition Time:  2024-02-23T21:35:21Z
    Message:               Requests to the target are being buffered as resources are provisioned.
    Reason:                Queued
    Severity:              Info
    Status:                Unknown
    Type:                  Active
    Last Transition Time:  2024-02-23T21:35:21Z
    Reason:                Deploying
    Status:                Unknown
    Type:                  ContainerHealthy
    Last Transition Time:  2024-02-23T21:35:21Z
    Message:               0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
    Reason:                Unschedulable
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-02-23T21:35:21Z
    Message:               0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
    Reason:                Unschedulable
    Status:                False
    Type:                  ResourcesAvailable

Internal Error #

KServe generates InternalError events to report unexpected behavior within the KServe infrastructure. These events typically show up as Kubernetes events (kubectl get events). MLIS surfaces these errors both in a deployment’s Timeline tab and when you run the aioli d events command.

These are not MLIS issues and, in most cases, they get resolved automatically by KServe with no user action required.

In cases where these errors appear in the Errors tab of the deployment (as opposed to the Timeline tab), there may be a configuration problem that needs to be resolved.

The following examples were all informational and were automatically resolved by KServe:

k get events |grep InternalError
59m           Warning   InternalError            revision/nim-jjh-predictor-00001                     failed to update deployment "nim-jjh-predictor-00001-deployment": Operation cannot be fulfilled on deployments.apps "nim-jjh-predictor-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
5m39s       Warning   InternalError            revision/nim-jjh-predictor-00001                     Unable to fetch image "determinedai/aioli-logger": failed to resolve image to digest: Get "https://index.docker.io/v2/": context deadline exceeded
7m23s       Warning   InternalError            route/nim-jjh-predictor                                      failed to remove route annotation to /, Kind= "nim-jjh-predictor": configurations.serving.knative.dev "nim-jjh-predictor" not found
5m35s       Warning   InternalError            inferenceservice/nim-jjh                                    fails to update InferenceService status: Operation cannot be fulfilled on inferenceservices.serving.kserve.io "nim-jjh": the object has been modified; please apply your changes to the latest version and try again
14m           Warning   InternalError            revision/nim-jjjh2-predictor-00001                  Unable to fetch image "determinedai/aioli-runtimes:utils.1": failed to resolve image to digest: Head "https://index.docker.io/v2/determinedai/aioli-runtimes/manifests/utils.1": context deadline exceeded
14m           Warning   InternalError            revision/nim-jjjh2-predictor-00001                  failed to update deployment "nim-jjjh2-predictor-00001-deployment": Operation cannot be fulfilled on deployments.apps "nim-jjjh2-predictor-00001-deployment": the object has been modified; please apply your changes to the latest version and try again

Model Load Failed #

RevisionMissing, RevisionFailed #

If you are getting opaque or “unknown” error messages from the failureType RevisionMissing or RevisionFailed, it may be that you have insufficient memory requests set on your packaged model. Try submitting a new version of the model with higher memory requests to resolve the issue.

Example error message:

'Revision "my-service-predictor-00001" failed with message: Container
failed with: failed to start containerd task "f445020dbaa45c92f65a28cf8daad2b3aadaa00bc8e61decee2c498d8e576886":
cannot start a stopped process: unknown.'

kubectl get pods
...
my-service-predictor-00001-deployment-c5b55bb9f-g6484   1/4     CrashLoopBackOff   8 (111s ago)   18m

Article Summarization

Failed Deployment

Before You Start #

Errors #

Insufficient nvidia.com/gpu #

Expand

Revision Output Example #

Internal Error #

Expand

Model Load Failed #

Expand

RevisionMissing, RevisionFailed #

Expand