Before You Start

Note
If you plan to use a remote PostgreSQL server instead of the default in-cluster PostgreSQL instance, ensure you have a PostgreSQL instance that supports SSL connections. See the Using a Remote PostgreSQL Server section for more details.

Quick Configuration Values

The following frequently used configuration values are available from the Helm chart.

Parameter Description Default
loadBalancerIP Static host/IP for accessing the controller via the aioli-proxy. Assigned by Kubernetes
loadBalancerProxyPort Port for accessing the controller via the aioli-proxy. 80
logLevel For debugging, specify debug or trace. info
image.master Latest published Master. Latest Master
imageRegistry The HPE MCS registry for the MLIS SKU you have purchased (e.g., hub.myenterpriselicense.hpe.com/hpe-mlis/<SKU>).
global.imagePullSecrets List of k8s secrets with docker credentials to enable installation from non-public repository.
tlsSecret k8s secret providing TLS configuration for HTTPS.
defaultPassword Admin account password Auto-generated if not set
Tip
You can view the full list of configurable values for the chart using the helm show values aioli-1.1.0.tgz command. In addition, each provided sub-chart (grafana, loki, prometheus, promtail, dex) also offers additional configurable values.

Helm Chart

Info
By default, the Helm install automatically provides a PostgreSQL instance in the cluster. If you want to use an existing remote PostgreSQL server, see the Using a Remote PostgreSQL Server section below.
# © Copyright 2023-2024 Hewlett Packard Enterprise Development LP
# HPE Machine Learning Inference Software (MLIS) Default Values

global:
  # imagePullSecrets allow you to pull images from private repositories.
  # This is required to access the licensed-MLIS containers, and is useful
  # to avoid potential throttling when accessing public Docker Hub repositories.
  # https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
  # Example:
  # imagePullSecrets:
  #   - name: hpe-mlis-registry
  #   - name: regcred
  #
  # To provide secrets as helm command line arguments, use:
  #   --set "global.imagePullSecrets[0].name=hpe-mlis-registry " --set="global.imagePullSecrets[1].name=regcred"
  imagePullSecrets: []

# imageRegistry specifies the source image repository for MLIS-provided images.
imageRegistry: determinedai
# HPE Machine Learning Inference Software (MLIS) uses the HPE MSC as the image registry
#imageRegistry: hub.myenterpriselicense.hpe.com/hpe-mlis/<SKU>
# Replace <SKU> in the URL shown above with the product SKU that was assigned to you, 
# and configure `imagePullSecrets` to include the HPE MSC credentials Kubernetes Secret (e.g. `hpe-mlis-registry`)
#
# To get HPE MSC credentials go to the https://myenterpriselicense.hpe.com website, and along with the information provided with your order
# create an HPE MSC credentials as a Kubernetes Secret (e.g. hpe-mlis-registry) using the following command:
# kubectl create secret docker-registry hpe-mlis-registry  \
# --docker-server=hub.myenterpriselicense.hpe.com/hpe-mlis/<SKU> \
# --docker-username=<HPE MSC user name>  \
# --docker-password=<HPE MSC MLIS license key> \
# --docker-email=<HPE MSC user email> \
# -n <MLIS deployment K8s namespace, if any>
#

# The image from imageRegistry to be used to pull the MLIS controller master image.
image:
  master: aioli-master:1.0.1-dev43

master:
  nodeSelector: {}
  # Configures the size of the PersistentVolumeClaim for the audit log.
  # Should be adjusted for scale.
  auditLogStorageSize: 1Gi
  # storageClassName configures the StorageClass used by the PersistentVolumeClaim for the
  # audit log. This can be left blank if a default storage class is specified in
  # the cluster. If dynamic provisioning of PersistentVolumes is disabled, users must manually
  # create a PersistentVolume that will match the PersistentVolumeClaim.
  # storageClassName:
  #
  # To improve product design, the controller and WebUI both collect anonymous information
  # about how MLIS is being used. This information includes various metrics and events such
  # as the number of registries, trained models, deployments, and more. Refer to the product
  # documentation for more information (search for telemetry). You can disable this data
  # collection at any time by setting telemetry-enabled to false.
  telemetry-enabled: true

# Default images used during the deployment 
defaultImages:
  # PostgreSQL image
  postgreSQL: "postgres:16"
  proxy: envoyproxy/envoy:v1.29-latest
  # logger is a side car that supports logging inference request/response data from imageRegistry.
  logger: aioli-logger
  # openllm container with GPU + vllm support from imageRegistry.
  openllm: aioli-runtimes:openllm-0.4.44-py-3.11-cuda-12
  # openllm-cpu container --backend pt for CPU support from imageRegistry.
  openllmCpu: aioli-runtimes:openllm-0.4.44-py-3.11-cpu
  # utils support container to enable model download and certificate management.
  utils: aioli-runtimes:utils.1
  # bentoml provides the default image that is used to run a bento from
  # S3 storage.
  bentoml: aioli-runtimes:bentoml-py-3.9
  # Kserve storage-initializer container
  storageInitializer: kserve/storage-initializer:v0.11.2

# Logger Level in master.yaml - Four severity levels: debug, info, warn, error
logLevel: info

# masterPort configures the port at which the controller listens for connections on.
masterPort: 8080

# Configure a static IP address for the envoy proxy that provides the inbound load balancer.
loadBalancerIP: ""

# Configure the external port for the envoy proxy that provides the inbound load balancer.
loadBalancerProxyPort: 80

# gpuSelector defines the configuration used when deploying a packaged model that
# specifies a gpuType value.
gpuSelector:
    # gke enables the Kserve GKE accelerator annotation when a gpuType is requested.
    # Specify it to override automatic detection of GKE.
    # gke: false
    # tolerationKey is the key name when generating a toleration to match the gpuType
    # value.  Set the value to "" to disable.  The default configuration generates:
    # tolerations:
    #  - effect: NoSchedule
    #    key: accelerator
    #    operator: Equal
    #    value: {gpuType}
    tolerationKey: "accelerator"

# Enables the creation of non-namespaced objects - Default: true
# Non-namespaced object are cluster-wide resources, such as the PriorityClasses.
# In multiple installation on a single cluster (using different namespaces), 
# this flag set to false avoids to recreate non-namespaced objects. In some cases (e.g., GitOps w/ArgoCD) 
# creating existing cluster-wide resources could stop/hang automatic deployments.
#
# WARNING 
# The first installation must run with the createNonNamespacedObjects flag set to true to ensure 
# the non-namespaced objects are created.
createNonNamespacedObjects: true

# External ca.crt injection certificate/s secret name
# Command to create the ca cert secret: 
#     kubectl create secret generic <external ca cert secret name, e.g., ext-ca-cert> --from-file=<ca.crt or ca bundle filename> -n <namespace>
#
# externalCaCertSecretName: <external ca cert secret name, e.g., ext-ca-cert>

# When useNodePortForMaster is set to false, a LoadBalancer service is deployed to make
# the controller reachable from outside the cluster. When useNodePortForMaster is set to
# true, the master will instead be exposed behind a NodePort service. When using a NodePort service
# users will typically have to configure an Ingress to make the controller reachable from
# outside the cluster. NodePort service is recommended when configuring TLS termination in a
# load-balancer.
useNodePortForMaster: true

# loggerPort provides a port that is reserved for use by Aioli within the inference
# service pod to enable request/response body logs.
loggerPort: "49160"

# Enable route support for Openshift by setting enabled to true. Configure tls termination (i.e edge) if needed.
# openshiftRoute:
# enabled:
# host:
# termination:

# tlsSecret enables TLS encryption for all communication made to the controller (TLS
# termination is performed in the controller). This includes communication between the
# controller and the task containers it launches, but does not include communication between
# the task containers (distributed training). The specified Secret of type tls must already exist in
# the same namespace used for the helm install.
# tlsSecret:

security:
  authz:
    #   type: rbac
    jwt_keys_directory: "/etc/aioli/jwt-signing/"


namespaces:
  # namespaces.exclude is a list of regex expressions used to filter out namespaces
  # that should not be used for deployment.  The default value excludes
  # KServe, Istio, Knative, GKE, Kubernetes, Cert Manager, and KinD namespaces.
  exclude:
    - "kube-.*"
    - "gke-.*"
    - "gmp-.*"
    - "cert-manager"
    - "istio-system"
    - "knative-serving"
    - "kserve"
    - "local-path-storage"
  # namespaces.include is a list of regex expressions used to allow only a limited
  # set of namespaces for deployment.  The default include expression allows any
  # namespace that is not prohibited by the exclude list.
  include:
    - ".*"


# db sets the configurations for the database.
db:
  nodeSelector: {}
  # To deploy your own Postgres DB, provide a hostAddress. If hostAddress is provided, no
  # Postgres DB will be deployed.
  # hostAddress:

  # Required parameters, whether you are using your own DB or a provided DB.
  #
  # If password is left blank, a random password will be generated. The helm
  # install will give instructions on how to retrieve the generated password.
  name: aioli
  user: postgres
  password:
  port: 5432

  # Only used for DB deployment. Configures the size of the PersistentVolumeClaim for the
  # deployed database, as well as the CPU and memory requirements. Should be adjusted for
  # scale.
  storageSize: 1Gi
  # Setting a request, breaks GKE deployment
  #  cpuRequest: 1
  memRequest: 1Gi

  # useNodePortForDB configures whether ClusterIP or NodePort service type is used for the
  # deployed DB. By default ClusterIP is used.
  useNodePortForDB: false

  # storageClassName configures the StorageClass used by the PersistentVolumeClaim for the
  # deployed database. This can be left blank if a default storage class is specified in
  # the cluster. If dynamic provisioning of PersistentVolumes is disabled, users must manually
  # create a PersistentVolume that will match the PersistentVolumeClaim.
  # storageClassName:

  # ssl_mode and ssl_root_cert configure the TLS connection to the database. Users must first
  # create a kubernetes secret or configMap containing their certificate and specify its name in
  # certResourceName. For sslRootCert, specify the name of the file only (not path).
  # sslMode: verify-ca
  # sslRootCert: <cert_name>
  # resourceType: <secret/configMap>
  # certResourceName: <secret/configMap name>

# Configuration for the envoy proxy
proxy:
  # The type of service to use for the proxy (NodePort, LoadBalancer, ClusterIP)
  # When LoadBalancer (the default), a LoadBalancer service is deployed to make
  # the controller & grafana reachable from outside the cluster via the proxy.
  # When NodePort, the proxy will instead be exposed behind a NodePort service.
  # When using a NodePort service users will typically have to configure an Ingress
  # to make the proxy reachable from outside the cluster.
  # NodePort service is recommended when configuring TLS termination in a load-balancer.
  # When ClusterIP, the proxy will be exposed behind a ClusterIP service.
  type: LoadBalancer
  annotations: {}
  labels: {}
  nodeSelector: {}

  # Set of services made available via the proxy
  services:
    envoyAdmin: false
    grafana: true
    loki: false
    prometheus: false

  # Proxy resource requests/limits
  cpuRequest: "1"
  memRequest: "500Mi"
  #cpulimit:
  #memLimit:

################################################################################
# This chart provides subcharts for Promtail, Loki, Prometheus and Grafana that
# are installed by default. The subcharts can be disabled at deployment time.
#
# Configurable values for promtail can be found at
#  "https://github.com/grafana/helm-charts/tree/main/charts/promtail"
# Configurable values for loki can be found at
#  "https://github.com/grafana/helm-charts/tree/main/charts/loki-stack"
# Configurable values for prometheus can be found at
#  "https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus"
# Configurable values for graphana can be found at
#  "https://grafana.com/docs/loki/latest/setup/install/helm/"
#
# If installing on a Rancher Kubernetes Engine with the default DNS Provider set
# to use CoreDNS, then Loki's global DNS Service must be set:
#   --set loki.global.dnsService=rke2-coredns-rke2-coredns
# The values shown below are used to configure Promtail, Loki Prometheus and
# Grafana.
################################################################################

promtail:
  enabled: true
  tolerations:
    - key: node-role.kubernetes.io/control-plane
      effect: NoSchedule
    - effect: NoSchedule
      operator: Exists
  config:
    snippets:
      extraRelabelConfigs:
        # Keep all kubernetes labels containing "inference" to preserve Kserve
        # and aioli pod labels on the logs.
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.*inference.+)

loki:
  enabled: true
  singleBinary:
    replicas: 1
  loki:
    commonConfig:
        replication_factor: 1
    storage:
      type: 'filesystem'
    auth_enabled: false
    compactor:
      # Enable the compactor to cleanup old data
      retention_enabled: true
    #limits_config:
    #  Default loki retention period is 30 days.
    #  retention_period: 744h


prometheus:
  enabled: true
  server:
    # Enable wal compression to reduce disk usage
    extraFlags:
      - storage.tsdb.wal-compression
    # retentionSize should be less than persistentVolume.size (which is 8Gi by default)
    retentionSize: 7GB
    global:
      # Increase scrape interval to enable quicker dashboard updates
      scrape_interval: 10s

# Configuration for grafana defaults
grafana:
  enabled: true
  # deployment_dashboard_baseurl a reference to the provided deployment dashboard that is pre-configured when this chart is installed.
  deployment_dashboard_baseurl: /grafana/d/b6943e8f-4162-4c88-8912-1f8dbd67e0eb/7f12f4ce-2b5c-5fad-a418-60d3d04968e0?orgId=1
  # deployment_dashboard_user is the Grafana user account which the deployment observability UI uses for cross launch into Grafana.
  deployment_dashboard_user: admin
  grafana.ini:
    server:
      root_url: /grafana/
      serve_from_sub_path: true
    auth.jwt:
      enabled: true
      header_name: X-JWT-Assertion
      username_claim: sub
      url_login: true
      key_file: /etc/aioli-public-key/jwt.pem
    #log:
    #  level: debug
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
      - name: Prometheus
        type: prometheus
        uid: EC961B58-1731-40BB-B7F1-28B5CD6FD6D5
        url: http://aioli-prometheus-server.{{ .Release.Namespace }}.svc.cluster.local
        # Set "readOnly" to false and "editable" to true to avoid this message in the UI:
        #
        # This data source was added by config and cannot be modified using the UI. Please contact your server admin to update this data source.
        readOnly: false
        editable: true
      - name: Loki
        type: loki
        uid: 0459878A-F358-4AC2-AB59-96BE56D0D65E
        url: http://loki-gateway.{{ .Release.Namespace }}.svc.cluster.local
        # Set "readOnly" to false and "editable" to true to avoid this message in the UI:
        #
        # This data source was added by config and cannot be modified using the UI. Please contact your server admin to update this data source.
        readOnly: false
        editable: true
  extraVolumes:
    - name: aioli-jwt-public-key
      emptyDir: {}
  extraVolumeMounts:  # Mounted into grafana container
    - name: aioli-jwt-public-key
      mountPath: /etc/aioli-public-key/
      readOnly: true
  extraContainerVolumes:
    - name: aioli-jwt-secrets
      secret:
        secretName: aioli-jwt-signing
  extraInitContainers:
    - name: aioli-jwt-secret-public-key-create
      # NOTE: This cannot be a template because it is in a values file, so we cannot reference defaultImages.utils defined above.
      image: determinedai/aioli-runtimes:utils.1
      imagePullPolicy: Always
      args:
          - -c
          -  openssl x509 -inform pem -in /mount/aioli/secrets/tls.crt -pubkey -noout >  /etc/aioli-public-key/jwt.pem
      volumeMounts:
          - name: aioli-jwt-public-key
            mountPath: /etc/aioli-public-key/
          - name: aioli-jwt-secrets
            mountPath:  /mount/aioli/secrets/
  dashboardProviders:
   dashboardproviders.yaml:
     apiVersion: 1
     providers:
     - name: 'default'
       orgId: 1
       folder: ''
       type: file
       disableDeletion: false
       editable: true
       options:
         path: /var/lib/grafana/dashboards/default
  dashboardsConfigMaps:
    default: grafana-dashboard-config-{{ .Release.Name }}



################################################################################
# Dex is an OpenID Connect identity hub. Dex can be used to expose a consistent
# OpenID Connect interface to your applications while allowing your users to
# authenticate using their existing credentials from various back-ends,
# including LDAP, SAML, and other OIDC providers.
#
# The sample connector configuration shown below is specific to authenticating
# with the "Auth0" identity provider.  Uncomment the "config -> connectors"
# section below and replace the sample connector with the connector that is
# appropriate for your identity provider.
#
# For information on how to configure connectors, such as Google, GitHub, LDAP,
# etc., see the Dex documentation at "https://dexidp.io/docs/connectors".
#
# If your connector requires a "redirectURI", you do not need to enter it, as
# it will be automatically generated for you.
#
# You cannot modify the configuration of the dex configSecret, or the
# DEX_STATIC_CLIENT_SECRET environment variable.
#
# Configurable values for dex can be found at
#  "https://github.com/dexidp/helm-charts/tree/master/charts/dex"
################################################################################

dex:
  logger:
    level: info
  configSecret:
    create: false
    name: aioli-dex-config
  envVars:
  - name: DEX_STATIC_CLIENT_SECRET
    valueFrom:
      secretKeyRef:
        name: aioli-oidc-secret-name
        key: aioli-oidc-secret-key
  config:
    logger:
      level: info
#   connectors:
#   - type: oidc
#     name: Auth0
#     id: auth0
#     config:
#       issuer: <Issuer URL>
#       clientID: <Client ID>
#       clientSecret: <Client Secret>

################################################################################
# oidc enables OpenID Connect Integration with Dex.  The values shown below
# are used to configure the controller as a Dex client.
################################################################################

oidc:
  enabled: false

#oidc:
#  enabled: true
#  autoProvisionUsers: true
#  displayNameAttributeName: email

# Default password for the admin account for the controller. If defaultPassword is
# left blank, a random password will be generated. The helm install will give
# instructions on how to retrieve the generated password.
defaultPassword:

# Specify by name a ConfigMap that contains the trusted CAs to be injected into the
# environment of a deployment. See https://cert-manager.io/docs/trust/trust-manager/
# for details on how to create a ConfigMap with trusted CAs. 
trustedCAsConfigMap: ""

# Integration of MLIS with AI Essentials.
ezua:
  enabled: false

MISC

Default Password

The admin password gets generated if it is not set during installation (--set defaultPassword). You can retrieve the generated password by using the following command:

   kubectl get secrets aioli-master-config-<RELEASE_NAME> \
   --template='{{ index .data "aioli-master.yaml" | base64decode }}' | grep defaultPassword

Replace <RELEASE_NAME> with the name of the Helm release (e.g., mlis).

Using a Remote PostgreSQL Server

This section describes how to use an existing remote PostgreSQL server instead of the default in-cluster PostgreSQL instance provided by the Helm install.

To configure MLIS to use a remote PostgreSQL server:

  1. Ensure you have a PostgreSQL instance that supports SSL connections.

  2. Obtain the SSL certificate authority (CA) certificate file for the server.

  3. Create a Kubernetes secret or configMap with the certificate:

    # Using a secret
    kubectl create secret generic <secret-name> --from-file=server.pem
    
    # Or using a configMap
    kubectl create configmap <configmap-name> --from-file=server.pem
  4. When installing MLIS with Helm, specify the following additional values:

    Value Description
    db.hostname The hostname of the database server
    db.port The port number the database is listening on
    db.sslMode The SSL connection mode (e.g., disable, require, verify-ca)
    db.sslRootCert The name of the CA certificate file (e.g., server.pem)
    db.resourceType Either ‘secret’ or ‘configMap’, depending on how you created the resource
    db.certResourceName The name of the secret or configMap you created
    db.password The database password

    For more information on PostgreSQL connection strings and SSL modes, refer to the PostgreSQL documentation on Connection Strings.

    Example Helm install command with remote database configuration using a secret:

    helm install mlis determined-ai/mlis \
      --set db.hostname=your-db-host.example.com \
      --set db.port=5432 \
      --set db.sslMode=verify-ca \
      --set db.sslRootCert=server.pem \
      --set db.resourceType=secret \
      --set db.certResourceName=your-secret-name \
      --set db.password=your-db-password

    Example Helm install command with remote database configuration using a configMap:

    helm install mlis determined-ai/mlis \
      --set db.hostname=your-db-host.example.com \
      --set db.port=5432 \
      --set db.sslMode=verify-ca \
      --set db.sslRootCert=server.pem \
      --set db.resourceType=configMap \
      --set db.certResourceName=your-certificate-configmap-name \
      --set db.password=your-db-password

Either example configuration allows MLIS to securely connect to your existing PostgreSQL server.

Observability Components

Configuring Observability Components

You can configure the observability components using the following Helm subcharts:

Defaults are generally used for the observability subcharts with the following items added via the default MLIS values.yaml file. All of these values may need to be tuned for your particular deployment:

  • Loki: Configured for a single replica with the following settings:
  • Promtail: Configured to enable collection of the pod labels that contain the word inference to enable identification of model and deployment versions.
  • Grafana: Configured to include the MLIS dashboard, and to enable SSO using JWT from the MLIS UI. It also automatically adds the Prometheus & Loki datasources.
Prometheus

MLIS has the following default configuration for Prometheus to enable more rapid reporting of metrics:

prometheus:
  server:
    extraFlags:
      - storage.tsdb.wal-compression
    retentionSize: 7GB
    global:
      scrape_interval: 10s

Some significant Prometheus helm chart defaults that you may want to configure are:

prometheus:
    server:
        retention: 15d
        persistentVolume:
            size: 8Gi

By default, Prometheus allocates only 8Gi of storage for metric history and retains metrics for 15 days (15d). If the disk requirements of those 15 days of metrics exceeds 8Gi, the prometheus server will fail.

Ensure retentionSize is less then persistentVolume.size (default is 8Gi). If you increase the prometheus.server.persistentVolume.size, adjust retentionSize accordingly. Note the different units (GB vs Gi).

See the Prometheus troubleshooting guide for more details.

Disabling Observability Components

--set grafana.enabled=false \
--set promtail.enabled=false \
--set prometheus.enabled=false \
--set loki.enabled=false
Warning

Disabling observability components renders the MLIS Deployment Dashboard link non-functional. You can substitute your own Grafana URI using grafana.deployment_dashboard_baseurl.

# Configuration for grafana defaults
grafana:
  enabled: true
  # deployment_dashboard_baseurl a reference to the provided deployment dashboard that is pre-configured when this chart is installed.
  deployment_dashboard_baseurl: /grafana/d/b6943e8f-4162-4c88-8912-1f8dbd67e0eb/7f12f4ce-2b5c-5fad-a418-60d3d04968e0?orgId=1

If you do not have SSO enabled for Grafana, you can replicate the initialization provided in the MLIS default values.yaml to enable JWT access from MLIS.

Rancher Kubernetes Engine

If you are installing on Rancher Kubernetes Engine with the default DNS Provider set to use CoreDNS, then Loki’s global DNS Service must be set (see RKE DNS Provider):

--set loki.global.dnsService=rke2-coredns-rke2-coredns

Node Selectors

You can use node labels to control which nodes the pods of this installation will run on by specifying a node selector during the install. The following example uses the node label of kubernetes.io/arch that has a value of amd64 as the node selector to run the pods:

--set master.nodeSelector."kubernetes.io/arch"=amd64 \
--set db.nodeSelector."kubernetes.io/arch"=amd64 \
--set proxy.nodeSelector."kubernetes.io/arch"=amd64