Helm Chart Values (HCVs) #

Before You Start #

Review the following dependencies:

Component	Minimum Version	Latest Version Validated	Dependency
Kubernetes	1.20	1.30	Core
Docker	2.6.0	2.6.0	Core
Helm	3.0	3.13.2	Core
KServe	0.11	0.14	Core
Istio	1.18	1.20.4	KServe
Istio Client		1.20.1	KServe
Istio Control Plane		1.20.4	KServe
Istio Data Plane		1.20.4	KServe
Knative	1.10	1.14.5	KServe
Knative Operator		1.14.5	KServe
Knative Serving		1.13.1	KServe
Cert Manager	1.9.0	1.15.1	KServe

Note

If you plan to use a remote PostgreSQL server instead of the default in-cluster PostgreSQL instance, ensure you have a PostgreSQL instance that supports SSL connections. See the Using a Remote PostgreSQL Server section for more details.

Quick Configuration Values #

The following frequently used configuration values are available from the Helm chart.

Parameter	Description	Default
`loadBalancerIP`	Static host/IP for accessing the controller via the aioli-proxy.	Assigned by Kubernetes
`loadBalancerProxyPort`	Port for accessing the controller via the aioli-proxy.	`80`
`logLevel`	For debugging, specify `debug` or `trace`.	`info`
`image.master`	Latest published Master.	Latest Master
`imageRegistry`	The HPE MCS registry for the MLIS SKU you have purchased (e.g., `hub.myenterpriselicense.hpe.com/hpe-mlis/<SKU>`).
`global.imagePullSecrets`	List of k8s secrets with docker credentials to enable installation from non-public repository.
`global.env`	List of environment variables to set in the `aioli-master` pod and deployment pods.
`tlsSecret`	k8s secret providing TLS configuration for HTTPS.
`defaultPassword`	Admin account password.	Auto-generated if not set

Tip

You can view the full list of configurable values for the chart using the helm show values aioli-1.3.0.tgz command. In addition, each provided sub-chart (grafana, loki, prometheus, promtail, dex) also offers additional configurable values.

Global Environment Variables #

Specifying environment variables during the Helm install allows you to inject environment variables directly into the aioli-master pod and deployment pods. These environment variables can be used to configure various aspects of your deployment, such as setting a proxy server.

For example, to set the http_proxy environment variable, you can update the values.yaml file as follows:

  global:
    env:
      - name: http_proxy
        value: "http://your-proxy-server:port"

You can also set these values directly from the Helm command line:

  helm install <release_name> <chart_name> --set global.env[0].name=http_proxy --set global.env[0].value=http://your-proxy-server:port

The environment variables specified during the Helm installation will be injected into the aioli-master pod and the inference service deployment pods. These environment variables can be overridden later in the packaged model or in specific deployments, which also allows additional environment variables to be specified.

Helm Chart #

Info

By default, the Helm install automatically provides a PostgreSQL instance in the cluster. If you want to use an existing remote PostgreSQL server, see the Using a Remote PostgreSQL Server section below.

# © Copyright 2023-2024 Hewlett Packard Enterprise Development LP
# HPE Machine Learning Inference Software (MLIS) Default Values

global:
  # imagePullSecrets allow you to pull images from private repositories.
  # This is required to access the licensed-MLIS containers, and is useful
  # to avoid potential throttling when accessing public Docker Hub repositories.
  # https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
  # Example:
  # imagePullSecrets:
  #   - name: hpe-mlis-registry
  #   - name: regcred
  #
  # To provide secrets as helm command line arguments, use:
  #   --set "global.imagePullSecrets[0].name=hpe-mlis-registry " --set="global.imagePullSecrets[1].name=regcred"
  imagePullSecrets: []

  # Environment variables to set in the "aioli-master" pod and deployment pods.
  env: []
  #env:
  #- name: MY_VARIABLE
  #  value: "my_value"
  #- name: ANOTHER_VARIABLE
  #  value: "another_value"
  #- name: MY_SECRET_ENV
  #  valueFrom:
  #    secretKeyRef:
  #      name: my-secret
  #      key: my-secret-key

# imageRegistry specifies the source image repository for MLIS-provided images.
imageRegistry: determinedai
# HPE Machine Learning Inference Software (MLIS) uses the HPE MSC as the image registry
#imageRegistry: hub.myenterpriselicense.hpe.com/hpe-mlis/<SKU>
# Replace <SKU> in the URL shown above with the product SKU that was assigned to you, 
# and configure `imagePullSecrets` to include the HPE MSC credentials Kubernetes Secret (e.g. `hpe-mlis-registry`)
#
# To get HPE MSC credentials go to the https://myenterpriselicense.hpe.com website, and along with the information provided with your order
# create an HPE MSC credentials as a Kubernetes Secret (e.g. hpe-mlis-registry) using the following command:
# kubectl create secret docker-registry hpe-mlis-registry  \
# --docker-server=hub.myenterpriselicense.hpe.com/hpe-mlis/<SKU> \
# --docker-username=<HPE MSC user name>  \
# --docker-password=<HPE MSC MLIS license key> \
# --docker-email=<HPE MSC user email> \
# -n <MLIS deployment K8s namespace, if any>
#

# The image from imageRegistry to be used to pull the MLIS controller master image.
image:
  master: aioli-master:1.3.0

master:
  nodeSelector: {}
  # Configures the size of the PersistentVolumeClaim for the audit log.
  # Should be adjusted for scale.
  auditLogStorageSize: 1Gi
  # storageClassName configures the StorageClass used by the PersistentVolumeClaim for the
  # audit log. This can be left blank if a default storage class is specified in
  # the cluster. If dynamic provisioning of PersistentVolumes is disabled, users must manually
  # create a PersistentVolume that will match the PersistentVolumeClaim.
  # storageClassName:
  #
  # To improve product design, the controller and WebUI both collect anonymous information
  # about how MLIS is being used. This information includes various metrics and events such
  # as the number of registries, trained models, deployments, and more. Refer to the product
  # documentation for more information (search for telemetry). You can disable this data
  # collection at any time by setting telemetry-enabled to false.
  telemetry-enabled: true

  # Image pull policy for the master image.
  # Valid values are: 'Never', 'IfNotPresent' and 'Always'
  imagePullPolicy: IfNotPresent

# modelsCacheStorage enables model caching on shared network storage.
# Once a model is downloaded to the cache the first time it's used by a deployment,
# subsequent deployments will use the cached model instead of downloading it again.
# Cached models that have not been used by a deployment for the time period
# specified by "purgeUnusedCachedModelsAfter" will be automatically removed from
# the cache. Models are automatically removed from the cache when they are
# removed from the database.
#
# Enable and specify a storageClassName to use model caching.  Refer to the product
# documentation for detailed requirements and limitations.
modelsCacheStorage:
  enabled: false
  checkUnusedCachedModelsEvery: 1d
  purgeUnusedCachedModelsAfter: 1w
  storageSize: 100Gi
  # storageClassName:
  # pvcNameSuffix is text appended to the end of the PVC name when creating the PVC
  # Name for the storage class. When changing the PV the model caching, a new PVC
  # with a unique name must be supplied to for MLIS PVC name.
  # pvcNameSuffix: appended-text
  # bypassStorageCheck enables use of storage that have not been validated by MLIS
  # for use with model caching.   The resulting PV must support access to the same
  # files when cloned between namespaces.
  # bypassStorageCheck: false

# Default images used during the deployment 
defaultImages:
  # PostgreSQL image
  postgreSQL: "postgres:16"
  proxy: envoyproxy/envoy:v1.29-latest
  # logger is a side car that supports logging inference request/response data from imageRegistry.
  logger: aioli-logger
  # openllm container with GPU + vllm support from imageRegistry.
  openllm: aioli-runtimes:v3-openllm-0.4.44-py-3.11-cuda-12
  # openllm-cpu container --backend pt for CPU support from imageRegistry.
  openllmCpu: aioli-runtimes:v3-openllm-0.4.44-py-3.11-cpu
  # utils support container to enable model download and certificate management.
  utils: aioli-runtimes:v3-utils
  # bentoml provides the default image that is used to run a bento from
  # S3 storage.
  bentoml: aioli-runtimes:v3-bentoml-py-3.9
  # Kserve storage-initializer container
  storageInitializer: kserve/storage-initializer:v0.11.2
  # PFS support storage-initializer container
  pfs: aioli-runtimes:v1-pfs-2.11.3

# Logger Level in master.yaml - Four severity levels: debug, info, warn, error
logLevel: info

# masterPort configures the port at which the controller listens for connections on.
masterPort: 8080

# Request/Limits for Cpu/Memory for the master deployment
masterCpuRequest: 250m
masterCpuLimit: 2
masterMemRequest: 50Mi
masterMemLimit: 2Gi

# Configure a static IP address for the envoy proxy that provides the inbound load balancer.
loadBalancerIP: ""

# Configure the external port for the envoy proxy that provides the inbound load balancer.
# The default value is 80. When tlsSecret is set, the default value is 443. An explcit
# value is always honored.
# loadBalancerProxyPort: 80

# gpuSelector defines the configuration used when deploying a packaged model that
# specifies a gpuType value.
gpuSelector:
    # gke enables the Kserve GKE accelerator annotation when a gpuType is requested.
    # Specify it to override automatic detection of GKE.
    # gke: false
    # tolerationKey is the key name when generating a toleration to match the gpuType
    # value.  Set the value to "" to disable.  The default configuration generates:
    # tolerations:
    #  - effect: NoSchedule
    #    key: accelerator
    #    operator: Equal
    #    value: {gpuType}
    tolerationKey: "accelerator"
    # resourceName allows the mapping from MLIS GPU number to GPU vendors.
    # Default to "nvidia.com/gpu". Specify "amd.com/gpu" to allocate AMD GPUs.
    # resourceName: "nvidia.com/gpu"

# Enables the creation of non-namespaced objects - Default: true
# Non-namespaced object are cluster-wide resources, such as the PriorityClasses.
# In multiple installation on a single cluster (using different namespaces), 
# this flag set to false avoids to recreate non-namespaced objects. In some cases (e.g., GitOps w/ArgoCD) 
# creating existing cluster-wide resources could stop/hang automatic deployments.
#
# WARNING 
# The first installation must run with the createNonNamespacedObjects flag set to true to ensure 
# the non-namespaced objects are created.
createNonNamespacedObjects: true

# External ca.crt injection certificate/s secret name
# Command to create the ca cert secret: 
#     kubectl create secret generic <external ca cert secret name, e.g., ext-ca-cert> --from-file=<ca.crt or ca bundle filename> -n <namespace>
#
# externalCaCertSecretName: <external ca cert secret name, e.g., ext-ca-cert>

# When useNodePortForMaster is set to false, a LoadBalancer service is deployed to make
# the controller reachable from outside the cluster. When useNodePortForMaster is set to
# true, the master will instead be exposed behind a NodePort service. When using a NodePort service
# users will typically have to configure an Ingress to make the controller reachable from
# outside the cluster. NodePort service is recommended when configuring TLS termination in a
# load-balancer.
useNodePortForMaster: true

# When useNodePortForMaster is set to true, nodePortForMaster can be set to a value between
# 30000-32767 that sets the NodePort's port number used to receive HTTP traffic.
#
# nodePortForMaster: 30080

# loggerPort provides a port that is reserved for use by Aioli within the inference
# service pod to enable request/response body logs.
loggerPort: "49160"

# loggerResources sets the resource requests and limits for the aioli-logger sidecar container
# that supports logging inference request/response data.
loggerResources:
    limits:
      cpu: 1
      memory: 1Gi
    requests:
      cpu: 10m
      memory: 20Mi

# Enable route support for Openshift by setting enabled to true. Configure tls termination (i.e edge) if needed.
# openshiftRoute:
# enabled:
# host:
# termination:

# tlsSecret enables TLS encryption for all communication made to the controller (TLS
# termination is performed in the controller). The specified Secret of type tls must already exist in
# the same namespace used for the helm install.
# tlsSecret:

security:
  authz:
    #   type: rbac
    jwt_keys_directory: "/etc/aioli/jwt-signing/"

integrations:
  # Integration with HPE MLDM/Pachyderm.
  pachyderm:
    # The full address/protocol of the pachd service.  For example:
    # grpc://pachd.<releaseNamespace>.svc.cluster.local:30650
    # When specified, and the OIDC configuration is specified to refer to the
    # pachd auth service, MLIS and MDLM will provide unified authentication tokens.
    address: ""

namespaces:
  # namespaces.exclude is a list of regex expressions used to filter out namespaces
  # that should not be used for deployment.  The default value excludes
  # KServe, Istio, Knative, GKE, Kubernetes, Cert Manager, and KinD namespaces.
  exclude:
    - "kube-.*"
    - "gke-.*"
    - "gmp-.*"
    - "cert-manager"
    - "istio-system"
    - "knative-serving"
    - "kserve"
    - "local-path-storage"
  # namespaces.include is a list of regex expressions used to allow only a limited
  # set of namespaces for deployment.  The default include expression allows any
  # namespace that is not prohibited by the exclude list.
  include:
    - ".*"

priorityClasses:
  # priorityClasses.exclude is a list of regex expressions used to filter out priority classes
  # that should not be used for deployment.  The default value excludes certain
  # reserved priority classes and some generated for use by MLDE.
  exclude:
    - "system-.*"
    - "aioli-system-.*"
    - "gmp-critical"
    - "[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}-priorityclass"
  # priorityClasses.include is a list of regex expressions used to allow only a limited
  # set of priority classes for deployment.  The default include expression allows any
  # priority class that is not prohibited by the exclude list.
  include:
    - ".*"


# db sets the configurations for the database.
db:
  nodeSelector: {}
  # To deploy your own Postgres DB, provide a hostAddress. If hostAddress is provided, no
  # Postgres DB will be deployed.
  # hostAddress:

  # Required parameters, whether you are using your own DB or a provided DB.
  #
  # If password is left blank, a random password will be generated. The helm
  # install will give instructions on how to retrieve the generated password.
  name: aioli
  user: postgres
  password:
  port: 5432

  # Image pull policy for the Postgres image
  # Valid values are: 'Never', 'IfNotPresent' and 'Always'
  imagePullPolicy: IfNotPresent

  # Only used for DB deployment. Configures the size of the PersistentVolumeClaim for the
  # deployed database, as well as the CPU and memory requirements. Should be adjusted for
  # scale.
  storageSize: 1Gi
  # Setting a request, breaks GKE deployment
  #  cpuRequest: 1
  memRequest: 1Gi

  # useNodePortForDB configures whether ClusterIP or NodePort service type is used for the
  # deployed DB. By default ClusterIP is used.
  useNodePortForDB: false

  # storageClassName configures the StorageClass used by the PersistentVolumeClaim for the
  # deployed database. This can be left blank if a default storage class is specified in
  # the cluster. If dynamic provisioning of PersistentVolumes is disabled, users must manually
  # create a PersistentVolume that will match the PersistentVolumeClaim.
  # storageClassName:

  # ssl_mode and ssl_root_cert configure the TLS connection to the database. Users must first
  # create a kubernetes secret or configMap containing their certificate and specify its name in
  # certResourceName. For sslRootCert, specify the name of the file only (not path).
  # sslMode: verify-ca
  # sslRootCert: <cert_name>
  # resourceType: <secret/configMap>
  # certResourceName: <secret/configMap name>

# Configuration for the envoy proxy
proxy:
  # The type of service to use for the proxy (NodePort, LoadBalancer, ClusterIP)
  # When LoadBalancer (the default), a LoadBalancer service is deployed to make
  # the controller & grafana reachable from outside the cluster via the proxy.
  # When NodePort, the proxy will instead be exposed behind a NodePort service.
  # When using a NodePort service users will typically have to configure an Ingress
  # to make the proxy reachable from outside the cluster.
  # NodePort service is recommended when configuring TLS termination in a load-balancer.
  # When ClusterIP, the proxy will be exposed behind a ClusterIP service.
  type: LoadBalancer
  annotations: {}
  labels: {}
  nodeSelector: {}

  # Proxy gateway timeout
  # timeout: 15s

  # Set of services made available via the proxy
  services:
    envoyAdmin: false
    grafana: true
    loki: false
    prometheus: false

  # Proxy resource requests/limits
  cpuRequest: "1"
  memRequest: "500Mi"
  #cpulimit:
  #memLimit:

################################################################################
# This chart provides subcharts for Promtail, Loki, Prometheus and Grafana that
# are installed by default. The subcharts can be disabled at deployment time.
#
# Configurable values for promtail can be found at
#  "https://github.com/grafana/helm-charts/tree/main/charts/promtail"
# Configurable values for loki can be found at
#  "https://github.com/grafana/helm-charts/tree/main/charts/loki-stack"
# Configurable values for prometheus can be found at
#  "https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus"
# Configurable values for graphana can be found at
#  "https://grafana.com/docs/loki/latest/setup/install/helm/"
#
# If installing on a Rancher Kubernetes Engine with the default DNS Provider set
# to use CoreDNS, then Loki's global DNS Service must be set:
#   --set loki.global.dnsService=rke2-coredns-rke2-coredns
# The values shown below are used to configure Promtail, Loki Prometheus and
# Grafana.
################################################################################

promtail:
  enabled: true
  tolerations:
    - key: node-role.kubernetes.io/control-plane
      effect: NoSchedule
    - effect: NoSchedule
      operator: Exists
  config:
    snippets:
      extraRelabelConfigs:
        # Keep all kubernetes labels containing "inference" to preserve Kserve
        # and aioli pod labels on the logs.
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.*inference.+)

loki:
  enabled: true
  singleBinary:
    replicas: 1
  loki:
    commonConfig:
        replication_factor: 1
    storage:
      type: 'filesystem'
    auth_enabled: false
    compactor:
      # Enable the compactor to cleanup old data
      retention_enabled: true
    #limits_config:
    #  Default loki retention period is 30 days.
    #  retention_period: 744h


prometheus:
  enabled: true
  server:
    # Enable wal compression to reduce disk usage
    extraFlags:
      - storage.tsdb.wal-compression
    # retentionSize should be less than persistentVolume.size (which is 8Gi by default)
    retentionSize: 7GB
    global:
      # Scrape interval to enable quicker dashboard updates (Prometheus default is 1m)
      # Values below 30s will prevent scale-to-zero of inference services.
      scrape_interval: 45s

# Configuration for grafana defaults
grafana:
  enabled: true
  # deployment_dashboard_baseurl a reference to the provided deployment dashboard that is pre-configured when this chart is installed.
  deployment_dashboard_baseurl: /grafana/d/b6943e8f-4162-4c88-8912-1f8dbd67e0eb/7f12f4ce-2b5c-5fad-a418-60d3d04968e0?orgId=1
  # deployment_dashboard_user is the Grafana user account which the deployment observability UI uses for cross launch into Grafana
  # for users with the Admin role. Users without the Admin role are dynamically provisioned as Grafana viewers.
  deployment_dashboard_user: admin
  grafana.ini:
    server:
      root_url: /grafana/
      serve_from_sub_path: true
    auth.jwt:
      enabled: true
      header_name: X-JWT-Assertion
      username_claim: sub
      url_login: true
      key_file: /etc/aioli-public-key/jwt.pem
      # users without the Admin role are auto-created as Grafana viewers if they are not already matched.
      auto_sign_up: true
    #log:
    #  level: debug
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
      - name: Prometheus
        type: prometheus
        uid: EC961B58-1731-40BB-B7F1-28B5CD6FD6D5
        url: http://{{ .Release.Name }}-prometheus-server.{{ .Release.Namespace }}.svc.cluster.local
        # Set "readOnly" to false and "editable" to true to avoid this message in the UI:
        #
        # This data source was added by config and cannot be modified using the UI. Please contact your server admin to update this data source.
        readOnly: false
        editable: true
      - name: Loki
        type: loki
        uid: 0459878A-F358-4AC2-AB59-96BE56D0D65E
        url: http://loki-gateway.{{ .Release.Namespace }}.svc.cluster.local
        # Set "readOnly" to false and "editable" to true to avoid this message in the UI:
        #
        # This data source was added by config and cannot be modified using the UI. Please contact your server admin to update this data source.
        readOnly: false
        editable: true
  extraVolumes:
    - name: aioli-jwt-public-key
      emptyDir: {}
  extraVolumeMounts:  # Mounted into grafana container
    - name: aioli-jwt-public-key
      mountPath: /etc/aioli-public-key/
      readOnly: true
  extraContainerVolumes:
    - name: aioli-jwt-secrets
      secret:
        secretName: aioli-jwt-signing
  extraInitContainers:
    - name: aioli-jwt-secret-public-key-create
      # NOTE: This cannot be a template because it is in a values file, so we cannot reference defaultImages.utils defined above.
      image: determinedai/aioli-runtimes:v3-utils
      # Valid values are: 'Never', 'IfNotPresent' and 'Always'
      imagePullPolicy: IfNotPresent
      args:
          - -c
          -  openssl x509 -inform pem -in /mount/aioli/secrets/tls.crt -pubkey -noout >  /etc/aioli-public-key/jwt.pem
      volumeMounts:
          - name: aioli-jwt-public-key
            mountPath: /etc/aioli-public-key/
          - name: aioli-jwt-secrets
            mountPath:  /mount/aioli/secrets/
  dashboardProviders:
   dashboardproviders.yaml:
     apiVersion: 1
     providers:
     - name: 'default'
       orgId: 1
       folder: ''
       type: file
       disableDeletion: false
       editable: true
       options:
         path: /var/lib/grafana/dashboards/default
  dashboardsConfigMaps:
    default: grafana-dashboard-config-{{ .Release.Name }}



################################################################################
# Dex is an OpenID Connect identity hub. Dex can be used to expose a consistent
# OpenID Connect interface to your applications while allowing your users to
# authenticate using their existing credentials from various back-ends,
# including LDAP, SAML, and other OIDC providers.
#
# The sample connector configuration shown below is specific to authenticating
# with the "Auth0" identity provider.  Uncomment the "config -> connectors"
# section below and replace the sample connector with the connector that is
# appropriate for your identity provider.
#
# For information on how to configure connectors, such as Google, GitHub, LDAP,
# etc., see the Dex documentation at "https://dexidp.io/docs/connectors".
#
# If your connector requires a "redirectURI", you do not need to enter it, as
# it will be automatically generated for you.
#
# You cannot modify the configuration of the dex configSecret, or the
# DEX_STATIC_CLIENT_SECRET environment variable.
#
# Configurable values for dex can be found at
#  "https://github.com/dexidp/helm-charts/tree/master/charts/dex"
################################################################################

dex:
  logger:
    level: info
  configSecret:
    create: false
    name: aioli-dex-config
  envVars:
  - name: DEX_STATIC_CLIENT_SECRET
    valueFrom:
      secretKeyRef:
        name: aioli-oidc-secret-name
        key: aioli-oidc-secret-key
  config:
    logger:
      level: info
#   connectors:
#   - type: oidc
#     name: Auth0
#     id: auth0
#     config:
#       issuer: <Issuer URL>
#       clientID: <Client ID>
#       clientSecret: <Client Secret>

################################################################################
# oidc enables OpenID Connect Integration with Dex.  The values shown below
# are used to configure the controller as a Dex client.
################################################################################

oidc:
  enabled: false
  # autoProvisionUsers specifies if users are automatically added to the database
  # upon successful authentication by the identity provider and, therefore, there is
  # no need to manually add users to the database with the CLI or REST API.
  # Valid values are "true" or "false". Default value is true. If set to true,
  # users are automatically added to the MLIS database upon successful authentication.
  # If set to false, the administrator must explicitly create users in the MLIS database
  # and assign their roles.
  # autoProvisionUsers: true

  # When autoProvisionUsers is set to true, authenticationClaim specifies the user's
  # Username when added to the MLIS database. Default value is "email", MLIS sets
  # the username of the user to the email address that is used to sign in with
  # the identity provider. Valid values are "email", "name", or "preferred_username".
  # authenticationClaim: email

  # When autoProvisionUsers is set to true, displayNameAttributeName specifies the user's
  # Display Name when added to the MLIS database. If not specified, the user's display name
  # is empty. Valid values are are "email", "name", or "preferred_username".
  # displayNameAttributeName: name

  # allowInsecureIssuerURLContext allows discovery to work when the issuer_url
  # reported by upstream is mismatched with the discovery URL. This is meant
  # for integration with off-spec providers. Valid values are "true" or "false".
  # allowInsecureIssuerURLContext: false

# Default password for the admin account for the controller. If defaultPassword is
# left blank, a random password will be generated. The helm install will give
# instructions on how to retrieve the generated password.  Once manually changed,
# this value is no longer relevant.
defaultPassword:

# Specify by name a ConfigMap that contains the trusted CAs to be injected into the
# environment of a deployment. See https://cert-manager.io/docs/trust/trust-manager/
# for details on how to create a ConfigMap with trusted CAs. 
trustedCAsConfigMap: ""

# Integration of MLIS with AI Essentials.
ezua:
  enabled: false

Additional Configuration Options #

Default Password #

The admin password gets generated if it is not set during installation (--set defaultPassword). You can retrieve the generated password by using the following command:

   kubectl get secrets aioli-master-config-<RELEASE_NAME> \
   --template='{{ index .data "aioli-master.yaml" | base64decode }}' | grep defaultPassword

Replace <RELEASE_NAME> with the name of the Helm release (e.g., mlis).

Using a Remote PostgreSQL Server #

This section describes how to use an existing remote PostgreSQL server instead of the default in-cluster PostgreSQL instance provided by the Helm install.

To configure MLIS to use a remote PostgreSQL server:

Ensure you have a PostgreSQL instance that supports SSL connections.
Obtain the SSL certificate authority (CA) certificate file for the server.

Create a Kubernetes secret or configMap with the certificate:

# Using a secret
kubectl create secret generic <secret-name> --from-file=server.pem

# Or using a configMap
kubectl create configmap <configmap-name> --from-file=server.pem

When installing MLIS with Helm, specify the following additional values:

Value	Description
`db.hostname`	The hostname of the database server
`db.port`	The port number the database is listening on
`db.sslMode`	The SSL connection mode (e.g., disable, require, verify-ca)
`db.sslRootCert`	The name of the CA certificate file (e.g., server.pem)
`db.resourceType`	Either ‘secret’ or ‘configMap’, depending on how you created the resource
`db.certResourceName`	The name of the secret or configMap you created
`db.password`	The database password

For more information on PostgreSQL connection strings and SSL modes, refer to the PostgreSQL documentation on Connection Strings.

Example Helm install command with remote database configuration using a secret:

helm install mlis determined-ai/mlis \
  --set db.hostname=your-db-host.example.com \
  --set db.port=5432 \
  --set db.sslMode=verify-ca \
  --set db.sslRootCert=server.pem \
  --set db.resourceType=secret \
  --set db.certResourceName=your-secret-name \
  --set db.password=your-db-password

Example Helm install command with remote database configuration using a configMap:

helm install mlis determined-ai/mlis \
  --set db.hostname=your-db-host.example.com \
  --set db.port=5432 \
  --set db.sslMode=verify-ca \
  --set db.sslRootCert=server.pem \
  --set db.resourceType=configMap \
  --set db.certResourceName=your-certificate-configmap-name \
  --set db.password=your-db-password

Either example configuration allows MLIS to securely connect to your existing PostgreSQL server.

Observability Components #

Configuring Observability Components #

You can configure the observability components using the following Helm subcharts:

Defaults are generally used for the observability subcharts with the following items added via the default MLIS values.yaml file. All of these values may need to be tuned for your particular deployment:

Loki: Configured for a single replica with the following settings:
- Log retention period set to 30 days.
- Log compaction is enabled.
- Storage defaults set to 10Gi.
Promtail: Configured to enable collection of the pod labels that contain the word inference to enable identification of model and deployment versions.
Grafana: Configured to include the MLIS dashboard, and to enable SSO using JWT from the MLIS UI. It also automatically adds the Prometheus & Loki datasources.

Prometheus #

MLIS has the following default configuration for Prometheus to enable more rapid reporting of metrics:

prometheus:
  server:
    extraFlags:
      - storage.tsdb.wal-compression
    retentionSize: 7GB
    global:
      scrape_interval: 10s

Some significant Prometheus helm chart defaults that you may want to configure are:

prometheus:
    server:
        retention: 15d
        persistentVolume:
            size: 8Gi

By default, Prometheus allocates only 8Gi of storage for metric history and retains metrics for 15 days (15d). If the disk requirements of those 15 days of metrics exceeds 8Gi, the prometheus server will fail.

Ensure retentionSize is less then persistentVolume.size (default is 8Gi). If you increase the prometheus.server.persistentVolume.size, adjust retentionSize accordingly. Note the different units (GB vs Gi).

See the Prometheus troubleshooting guide for more details.

Disabling Observability Components #

--set grafana.enabled=false \
--set promtail.enabled=false \
--set prometheus.enabled=false \
--set loki.enabled=false

Warning

Disabling observability components renders the MLIS Deployment Dashboard link non-functional. You can substitute your own Grafana URI using grafana.deployment_dashboard_baseurl.

# Configuration for grafana defaults
grafana:
  enabled: true
  # deployment_dashboard_baseurl a reference to the provided deployment dashboard that is pre-configured when this chart is installed.
  deployment_dashboard_baseurl: /grafana/d/b6943e8f-4162-4c88-8912-1f8dbd67e0eb/7f12f4ce-2b5c-5fad-a418-60d3d04968e0?orgId=1

If you do not have SSO enabled for Grafana, you can replicate the initialization provided in the MLIS default values.yaml to enable JWT access from MLIS.

Rancher Kubernetes Engine #

If you are installing on Rancher Kubernetes Engine with the default DNS Provider set to use CoreDNS, then Loki’s global DNS Service must be set (see RKE DNS Provider):

--set loki.global.dnsService=rke2-coredns-rke2-coredns

Node Selectors #

You can use node labels to control which nodes the pods of this installation will run on by specifying a node selector during the install. The following example uses the node label of kubernetes.io/arch that has a value of amd64 as the node selector to run the pods:

--set master.nodeSelector."kubernetes.io/arch"=amd64 \
--set db.nodeSelector."kubernetes.io/arch"=amd64 \
--set proxy.nodeSelector."kubernetes.io/arch"=amd64

Model Cache Storage #

MLIS supports model caching on shared network storage. This feature can be enabled and configured using the following parameters:

modelsCacheStorage:
  enabled: false
  checkUnusedCachedModelsEvery: 1d
  purgeUnusedCachedModelsAfter: 1w
  storageSize: 100Gi
  # storageClassName:
  # pvcNameSuffix: 
  # bypassStorageCheck: false

When enabled, this feature allows models to be cached on shared storage, reducing download times for subsequent deployments. See the Model Cache guide for more details.

GPU Selector #

MLIS provides configuration options for GPU selection when deploying models that specify a gpuType:

gpuSelector:
    # tolerationKey: "accelerator"
    # resourceName: "nvidia.com/gpu"

This allows you to control how GPUs are allocated and which types of GPUs are used for model deployments. See the GPU support guide for more details.

Non-Namespaced Objects #

By default, MLIS creates non-namespaced (cluster-wide) objects such as PriorityClasses. This behavior can be controlled with the createNonNamespacedObjects parameter:

createNonNamespacedObjects: true

External CA Certificates #

You can inject external CA certificates into MLIS by creating a secret and specifying its name:

# externalCaCertSecretName: <external ca cert secret name, e.g., ext-ca-cert>

This can be useful when MLIS needs to trust additional certificate authorities. See the Configure HTTPS/TLS for External Repositories guide for more details.

OpenShift Route #

For OpenShift deployments, MLIS supports configuring routes:

# openshiftRoute:
# enabled:
# host:
# termination:

This allows you to expose MLIS services using OpenShift’s routing layer.