Pre-load Packaged Model (PVC)

When deploying an inference service, it can take a significant amount of time for the service to reach the Ready state if it has to download and load a large AI model during startup. To optimize this process, you can preload the model onto a Persistent Volume Claim (PVC), allowing the service to access the model immediately upon startup. This approach improves the responsiveness of your deployment, particularly in scenarios requiring rapid scaling or high availability.

Before You Start

  • Create a PVC: It is your responsibility to create a PVC in the same Kubernetes namespace where you intend to deploy your inference service.
  • Sufficient Storage: Make sure the PVC has enough capacity to store the entire model. The storage requirements will vary based on the size of the model you are using.
  • Review any necessary resources specific to the model you want to pre-load:

Note on HuggingFace Tokens

If the model you wish to download from HuggingFace requires authentication (i.e., a token), you must set the HF_TOKEN environment variable with your HuggingFace token. The following HuggingFace example assumes that this token is stored in the environment variable HF_TOKEN.


Examples

HuggingFace

The following example demonstrates how to preload the meta-llama/Meta-Llama-3-8B-Instruct model from HuggingFace onto a PVC named models-cache-pvc. This model requires a HuggingFace token and access permissions, which can be requested at huggingface.co. This example employs a Kubernetes Job that executes the huggingface-cli command to download the model to the /mnt/models/meta-llama3-8b-instruct directory on the PVC. However, you can customize the Job to use any method that suits your needs for preloading models.

apiVersion: batch/v1
kind: Job
metadata:
  name: download-llama3-8b-instruct-model
spec:
  template:
    spec:
      containers:
      - name: model-installer
        image: kserve/huggingfaceserver:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            MODEL_DIR="/mnt/models/meta-llama3-8b-instruct"

            huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --token $HF_TOKEN --local-dir $MODEL_DIR

            [ $? -eq 0 ] && echo "Model download complete" || echo "Model download failed"            
        env:
        - name: HF_TOKEN
          value: hf_XXXXXXXXXXXXXXXXXXXXXX
        volumeMounts:
        - name: models-cache
          mountPath: /mnt/models
      restartPolicy: OnFailure
      volumes:
      - name: models-cache
        persistentVolumeClaim:
          claimName: models-cache-pvc

Applying the Kubernetes Job

After creating the YAML file (e.g., download-llama3-8b-instruct-model.yaml) with the content provided in the example above, you can apply it to your Kubernetes cluster using the following command:

kubectl apply -f download-llama3-8b-instruct-model.yaml
Tip

You can monitor the logs to view the progress of the model being downloaded onto the PVC or to check for any errors that may have occurred.

kubectl logs job/download-llama3-8b-instruct-model

When the model has been successfully downloaded, the last few lines of the log should look similar to the following output:

Download complete. Moving file to /mnt/models/meta-llama3-8b-instruct/model-00002-of-00004.safetensors
Fetching 17 files:  47%|████▋     | 8/17 [00:52<01:07,  7.49s/it]Download complete. Moving file to /mnt/models/meta-llama3-8b-instruct/original/consolidated.00.pth
Fetching 17 files: 100%|██████████| 17/17 [02:02<00:00,  7.22s/it]
/mnt/models/meta-llama3-8b-instruct
Model download complete

At this point, you can now reference the model stored on the PVC in your inference service deployment.