AWS Deployment

Before You Start #

This guide assumes that you have already tried HPE Machine Learning Data Management locally and have all of the following installed:

Kubernetes & Openshift Version Support

Kubernetes: HPE Machine Learning Data Management supports the three most recent minor release versions of Kubernetes. If your Kubernetes version is not among these, it is End of Life (EOL) and unsupported. This ensures HPE Machine Learning Data Management users access to the latest Kubernetes features and bug fixes.
Openshift: HPE Machine Learning Data Management is compatible with OpenShift versions within the “Full Support” window.

Hardened Security and Dependency Considerations

If you are deploying in a hardened security environment, such as within the DoD community or other regulated sectors, consider downloading and installing HPE Machine Learning Data Management from Iron Bank, a hardened container registry.

MLDM images may be pulled from Iron Bank by updating the global registry setting in the MLDM Helm chart values.yaml to use registry1.dso.mil/, e.g.

global:
  ...
  image:
    registry: registry1.dso.mil/

Additionally, note that the MLDM Helm chart relies on the Bitnami image and its associated sub-chart. If the Bitnami image is unavailable or if your available PostgreSQL image cannot be managed through the Bitnami sub-chart, you will need to install PostgreSQL separately. Refer to Global Helm Chart Values for details on specifying your separate PostgreSQL instance. Also, refer to Non-Bundled Database Setup for more detail on using your own PostgreSQL instance with MLDM.

If you have questions, please reach out to your Customer Support Engineer for assistance before proceeding.

1. Create an EKS Cluster #

Use the eksctl tool to deploy an EKS Cluster:

eksctl create cluster --name pachyderm-cluster --region <region> -profile <your named profile>

Verify deployment:
```
kubectl get all
```

2. Create an S3 Bucket #

Run the following command:

aws s3api create-bucket --bucket ${BUCKET_NAME} --region ${AWS_REGION}

Verify.
```
aws s3 ls
```

3. Enable Persistent Volumes Creation #

Create an IAM OIDC provider for your cluster.
Install the Amazon EBS Container Storage Interface (CSI) driver on your cluster.

Create a gp3 storage class manifest file (e.g., gp3-storageclass.yaml)

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  fsType: ext4

Set gp3 to your default storage class.
```
kubectl apply -f gp3-storageclass.yaml
```
Verify that it has been set as your default.
```
kubectl get storageclass
```

4. Set up an RDS PostgreSQL Instance #

By default, HPE Machine Learning Data Management runs with a bundled version of PostgreSQL. For production environments, it is strongly recommended that you disable the bundled version and use an RDS PostgreSQL instance.

Warning

Aurora Serverless PostgreSQL is not supported.

In the RDS console, create a database in the region matching your HPE Machine Learning Data Management cluster.
Choose the PostgreSQL engine.
Select a PostgreSQL version >= 13.3.
Configure your DB instance as follows:

SETTING	Recommended value
DB instance identifier	Fill in with a unique name across all of your DB instances in the current region.
Master username	Choose your Admin username.
Master password	Choose your Admin password.
DB instance class	The standard default should work. You can change the instance type later on to optimize your performances and costs.
Storage type and Allocated storage	If you select io1, keep the 100 GiB default size. Read more information on Storage for RDS on Amazon’s website.
Storage autoscaling	If your workload is cyclical or unpredictable, enable storage autoscaling to allow RDS to scale up your storage when needed.
Standby instance	We highly recommend creating a standby instance for production environments.
VPC	Select the VPC of your Kubernetes cluster. Attention: After a database is created, you can’t change its VPC. Read more on VPCs and RDS on Amazon documentation.
Subnet group	Pick a Subnet group or Create a new one. Read more about DB Subnet Groups on Amazon documentation.
Public access	Set the Public access to `No` for production environments.
VPC security group	Create a new VPC security group and open the postgreSQL port or use an existing one.
Password authentication or Password and IAM database authentication	Choose one or the other.
Database name	In the Database options section, enter HPE Machine Learning Data Management’s Database name (We are using `Pachyderm`in this example.) and click Create database to create your PostgreSQL service. Your instance is running. Warning: If you do not specify a database name, Amazon RDS does not create a database.

Info

Standalone Clusters

If you are deploying a standalone cluster, you must create a second database named dex in your RDS instance for HPE Machine Learning Data Management’s authentication service. Read more about dex on PostgreSQL in Dex’s documentation.

Multi-cluster setups use Enterprise Server to handle authentication, so you do not need to create a dex database.

Create a new user account and grant it full CRUD permissions to both the Pachyderm and (when applicable) dex databases. Read about managing PostgreSQL users and roles in this blog. HPE Machine Learning Data Management will use the same username to connect to Pachydermas well as to dex.

5. Create a Values.yaml #

global:
  postgresql:
    postgresqlAuthType: "scram-sha-256" # use "md5" if using postgresql < 14
    postgresqlUsername: "username"
    postgresqlPassword: "password" 
    postgresqlDatabase: "databasename"
    postgresqlHost: "RDS CNAME"
    postgresqlPort: "5432"

postgresql:
  enabled: false

deployTarget: "AMAZON"

proxy:
  enabled: true
  service:
    type: LoadBalancer

pachd:
  storage:
    backend: "AMAZON"
    storageURL: "s3://bucket_name"
    amazon:
      id: "<ACCESS_KEY_HERE>"                      
      secret: "<SECRET_ACCESS_KEY_HERE>"
      token: "<YOUR_TOKEN_HERE>"

 console:
   enabled: true

Warning

Setting up Authentication?

Do not use mockIDP for clusters that will be deployed into production. If you do upgrade a cluster with mockIDP enabled, you must revoke the default mockIDP admin user by running the following command:

pachctl auth revoke --user kilgore@kilgore.trout

global:
  postgresql:
    postgresqlAuthType: "scram-sha-256" # use "md5" if using postgresql < 14
    postgresqlUsername: "username"
    postgresqlPassword: "password" 
    postgresqlDatabase: "databasename"
    postgresqlHost: "RDS CNAME"
    postgresqlPort: "5432"

postgresql:
  enabled: false

deployTarget: "AMAZON"

proxy:
  enabled: true
  service:
    type: LoadBalancer
  
pachd:
  storage:
    backend: "AMAZON"
    storageURL: "s3://bucket_name"
    amazon:
      id: "<ACCESS_KEY_HERE>"                      
      secret: "<SECRET_ACCESS_KEY_HERE>"
      token: "<YOUR_TOKEN_HERE>"
  externalService:
    enabled: true
  enterpriseLicenseKey: "YOUR_ENTERPRISE_TOKEN"

console:
  enabled: true

Note

If your Postgres deployment requires SSL, you may need to set up additional parameters in the global section of your Helm Chart Values (HCVs)

6. Configure Helm #

Run the following to add the HPE Machine Learning Data Management repo to Helm:

helm repo add pachyderm https://helm.pachyderm.com
helm repo update
helm install pachyderm pachyderm/pachyderm -f my_pachyderm_values.yaml

7. Verify Installation #

In a new terminal, run the following command to check the status of your pods:

kubectl get pods

NAME                                           READY   STATUS      RESTARTS   AGE
pod/console-5b67678df6-s4d8c                   1/1     Running     0          2m8s
pod/etcd-0                                     1/1     Running     0          2m8s
pod/pachd-c5848b5c7-zwb8p                      1/1     Running     0          2m8s
pod/pg-bouncer-7b855cb797-jqqpx                1/1     Running     0          2m8s
pod/postgres-0                                 1/1     Running     0          2m8s

Re-run this command after a few minutes if pachd is not ready.

8. Connect to Cluster #

You’ll need your organization’s cluster URL (proxy.host) value to connect.

Run the following command to get your cluster URL:

kubectl get services | grep pachyderm-proxy | awk '{print $4}'

Connect to your cluster: