Google Kubernetes Engine (GKE) Cluster

GenAI Studio can be installed on a cluster that is hosted on a managed Kubernetes service such as Google Kubernetes Engine (GKE). To understand how Machine Learning Development Environment runs on Kubernetes, visit Deploy on Kubernetes.

Prerequisites

Before starting the installation, you’ll need the following:

  • A Google Cloud account.
  • Access to a service account with permissions needed to create instances and clusters.
  • Identify a region with at least a100 GPUs available.
  • Ensure you have either provided a filesystem PVC designated for GenAI Studio or have a Storage Class in your cluster that supports ReadWriteMany for GenAI Studio models and datasets.
GPU Finder
You can use the GPU finder to find and provision Compute Engine Instances with GPUs.

Supported GPUs

Model configuration testing was conducted on AWS instances. You can look at the tested AWS instances and find corresponding instances with the same GPU type, GPU count, and similar memory size to be used for setting up your cluster on GKE.

In general, GenAI Studio supports the following instance types:

  • A100
  • V100
  • T4

See also: Supported hardware reference.

How to Install GenAI Studio via GKE

Create a GKE Cluster

  • Go to Google Cloud Kubernetes Engine and select Create Cluster under your project.
  • Choose Standard: You manage your cluster and then select Configure.
  • Set the region to the region you identified in the prerequisites.
  • Add a node pool with A100 GPUs. Avoid using autoscaling to prevent potential issues with node availability.
    • Ensure any additional GPU node pools you add have a boot disk size of at least 400GB.
  • Establish a large shared network drive with ReadWriteMany enabled for GenAI Studio models and datasets.
    • To do this, you can select Enable Filestore CSI Driver under Features.
    • Or, you can use any other form of shared PVC with ReadWriteMany enabled.

Connect and Enable NVIDIA Drivers

  • Once the cluster is deployed, connect to it using kubectl.
  • Then, execute the following command to enable the NVIDIA drivers:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Configure Helm Values.yaml

  1. Update the default values.yaml with your preferred settings. See the configuration reference guide for an exhaustive list of options.
  2. Add the genai section to the values.yaml file.
    ## Configure GenAI Deployment
    genai:
      ## Version of GenAI to use. If unset, GenAI will not be deployed
      version: "0.2.4"
    
      ## Port for GenAI backend to use
      port: 9011
    
      ## Port for GenAI message queue
      messageQueuePort: 9013
    
      ## Secret to pull the GenAI image
      # imagePullSecretName:
    
      ## GenAI pod memory request
      memRequest: 1Gi
    
      ## GenAI pod cpu request
      cpuRequest: 100m
    
      ## GenAI pod memory limit
      # memLimit: 1Gi
    
      ## GenAI pod cpu limit
      # cpuLimit: 2
    
      ## Host Path for the Shared File System for GenAI.
      ## If you are providing your own shared file system to use
      ## in GenAI, specify its host path here.
      ## Note: This takes precedence over creating a PVC with
      ## `generatedPVC`.
      # sharedFSHostPath:
    
      ## Internal path to mount the shared_fs drive to.
      ## If you are using multiple shared file systems, it can help 
      ## to be able to configure where the file systems mount.
      ## When unset, defaults to `/run/determined/workdir/shared_fs`
      # sharedFSMountPath: /run/determined/workdir/shared_fs
    
      ## PVC Name for the shared file system for GenAI.
      ## Note: Either `sharedPVCName` or `generatedPVC.storageSize` (to
      ## generate a new PVC) is required for GenAI deployment
      # sharedPVCName:
    
      ## Spec for the generated PVC for GenAI
      ## Note: In order to generate a shared PVC, you will need access to a
      ## StorageClass that can provide a ReadWriteMany volume
      generatedPVC:
        ## Storage class name for the generated PVC
        storageClassName: standard-rwx
    
        ## Size of the generated PVC
        storageSize: 1Ti
    
      ## Unix Agent Group ID for the Shared Filesystem.
      ## This setting is required to run your cluster with unprivileged users.
      ## This setting is not required if users will be running their experiments as root.
      ## Note: All users that work with GenAI need to have this assigned as their
      ## Agent Group ID in the User Admin settings.
      ## More info here: 
      ## https://hpe-ai-solutions-documentation.netlify.app/products/gen-ai/latest/admin/set-up/deployment-guides/kubernetes/enforce-shared-user-agent-ids/
      # agentGroupID: 1100
    
      ## Whether or not we should attempt to run a Helm hook to initialize
      ## the shared filesystem to use the agentGroupID as its group.
      ## This must be turned off on clusters that disable pods that can run as root.
      ## More info here: 
      ## https://hpe-ai-solutions-documentation.netlify.app/products/gen-ai/latest/admin/set-up/deployment-guides/kubernetes/enforce-shared-user-agent-ids/
      shouldInitializeSharedFSGroupPermissions: false
    
      ## Extra Resource Pool Metadata is hardcoded information about the
      ## GPUs available to the resource pools. This information
      ## is not provided in k8s so we provide it directly.
      ## Note: All resource pools defined here need to also be reflected in
      ## the .Values.resourcePools.
      # extraResourcePoolMetadata:
      #   A100:
      #     gpu_type: A100
      #     max_agents: 3
      #   V100:
      #     gpu_type: V100
      #     max_agents: 2
    • Version of GenAI to use: Under version, set the version to 0.2.4.
    • sharedPVCName: Under sharedPVCName, specify the name of the shared PVC in your cluster that is designated as the shared network drive for GenAI Studio. Otherwise, ensure that the storageClassName reflects a StorageClass with ReadWriteMany enabled.
    • Extra Resource Pool Metadata: For every GPU type present in the cluster, add an entry under extraResourcePoolMetadata. More specifically, you must manually specify the GPU type and max agents (physical nodes) for any of the GPU-based resource pools you are using.
  3. Update the values.yaml resourcePools section to include the resource pools you want to use, along with any appropriate taints and tolerations. The types of GPUs available depend on the hardware you have access to.
    resourcePools:
      - pool_name: A100
        task_container_defaults:
          kubernetes:
            max_slots_per_pod: 8
          gpu_pod_spec:
            apiVersion: v1
            kind: Pod
            spec:
              tolerations:
                - key: "accelerator"
                  operator: "Equal"
                  value: "NVIDIA-A100-ABCD-80GB"
                  effect: "NoSchedule"
      - pool_name: T4
        task_container_defaults:
          kubernetes:
            max_slots_per_pod: 6
          gpu_pod_spec:
            apiVersion: v1
            kind: Pod
            spec:
              tolerations:
                - key: "accelerator"
                  operator: "Equal"
                  value: "Tesla-T4"
                  effect: "NoSchedule"

Install

  1. Enable repository:
    helm repo add determined-ai https://helm.determined.ai/
  2. List repos:
    helm repo list
  3. Get updates to repo (helm doesn’t automatically update):
    helm repo update
  4. Show current version of determined-ai in repo:
    helm search repo determined
  5. Install the latest version of the determined helm chart with your modified values.yaml:
    helm install -f values.yaml \
    --generate-name determined-ai/determined \
    --version "0.35.0" \
    --set maxSlotsPerPod=4

Configure Shared Filesystem and User Permissions

After setting up your Kubernetes cluster, you should next configure the shared filesystem and user permissions to ensure effective management of datasets. This configuration step prevents permission-related problems when accessing datasets. For detailed guidance and a sample script, see Enforcing Shared User Agent Group IDs.

Setting Up Multiple Resource Pools