Default disk sizes on cloud providers may not be sufficient for large models, which often require significant ephemeral storage. This can result in the inference service failing to start serving—in some instances, without providing an error message about being out of disk space.
Ephemeral storage is normally provided by the boot disk of the compute nodes. You can inspect the amount of ephemeral storage on your nodes using the kubectl describe node <node-name> command.
Deploying a service with GPU request and limits of 1 on a cluster without nodes that contain a GPU will fail to deploy and remain in the deploying state. Using kubectl to describe the revision will show the status with the following messages:
KServe generates InternalError events to report unexpected behavior within the KServe infrastructure. These events typically show up as Kubernetes events (kubectl get events). MLIS surfaces these errors both in a deployment’s Timeline tab and when you run the aioli d events command.
These are not MLIS issues and, in most cases, they get resolved automatically by KServe with no user action required.
In cases where these errors appear in the Errors tab of the deployment (as opposed to the Timeline tab), there may be a configuration problem that needs to be resolved.
The following examples were all informational and were automatically resolved by KServe:
k get events |grep InternalError
59m Warning InternalError revision/nim-jjh-predictor-00001 failed to update deployment "nim-jjh-predictor-00001-deployment": Operation cannot be fulfilled on deployments.apps "nim-jjh-predictor-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
5m39s Warning InternalError revision/nim-jjh-predictor-00001 Unable to fetch image "determinedai/aioli-logger": failed to resolve image to digest: Get "https://index.docker.io/v2/": context deadline exceeded
7m23s Warning InternalError route/nim-jjh-predictor failed to remove route annotation to /, Kind="nim-jjh-predictor": configurations.serving.knative.dev "nim-jjh-predictor" not found
5m35s Warning InternalError inferenceservice/nim-jjh fails to update InferenceService status: Operation cannot be fulfilled on inferenceservices.serving.kserve.io "nim-jjh": the object has been modified; please apply your changes to the latest version and try again
14m Warning InternalError revision/nim-jjjh2-predictor-00001 Unable to fetch image "determinedai/aioli-runtimes:utils.1": failed to resolve image to digest: Head "https://index.docker.io/v2/determinedai/aioli-runtimes/manifests/utils.1": context deadline exceeded
14m Warning InternalError revision/nim-jjjh2-predictor-00001 failed to update deployment "nim-jjjh2-predictor-00001-deployment": Operation cannot be fulfilled on deployments.apps "nim-jjjh2-predictor-00001-deployment": the object has been modified; please apply your changes to the latest version and try again
When rolling out changes to a deployment that previously encountered errors, KServe may surface failure messages from a prior version of the inference service. This can cause the deployment to show a Failed status while the new deployment is in progress.
To determine if this is the case:
Check the Latest event column to see if a new deployment is in progress.
Select the deployment and review the Timeline tab to check previous events.
These errors will clear on their own as the new deployment reaches a Ready status.
If you are getting opaque or “unknown” error messages from the failureType RevisionMissing or RevisionFailed, it may be that you have insufficient memory requests set on your packaged model. Try submitting a new version of the model with higher memory requests to resolve the issue.
Example error message:
'Revision "my-service-predictor-00001" failed with message: Container
failed with: failed to start containerd task "f445020dbaa45c92f65a28cf8daad2b3aadaa00bc8e61decee2c498d8e576886":
cannot start a stopped process: unknown.'
kubectl get pods
...
my-service-predictor-00001-deployment-c5b55bb9f-g6484 1/4 CrashLoopBackOff 8(111s ago) 18m