Backups

This page will walk you through the main steps required to manually back up the state of a HPE Machine Learning Data Management cluster in production. Details on how to perform those steps might vary depending on your infrastructure and setup. Refer to your provider’s documentation when applicable.

Before You Start

  • Make sure to retain a copy of the Helm values used to deploy your cluster
  • Suspend any state-mutating operations
  • Make sure that you have a bucket for backup use, separate from the object store used by your cluster

Downtime Considerations

  • Backups incur downtime until operations are resumed
  • Operational best practices include notifying HPE Machine Learning Data Management users of the outage and providing an estimated time when downtime will cease
  • Downtime duration is dependent on the size of the data to be backed up and the networks involved
  • Testing before going into production and monitoring backup times on an ongoing basis might help make accurate predictions

How to Create a Backup

HPE Machine Learning Data Management state is stored in two main places:

  • An object-store holding HPE Machine Learning Data Management’s data.
  • A PostgreSQL instance made up of one or two databases:
    • pachyderm holding HPE Machine Learning Data Management’s metadata
    • dex holding authentication data

Backing up a HPE Machine Learning Data Management cluster involves snapshotting both the object store and the PostgreSQL database(s), in a consistent state, at a given point in time. Restoring a cluster involves re-populating the database(s) and the object store using those backups, then recreating a HPE Machine Learning Data Management cluster.

  1. Review any cloud-specific backup and restore procedures for your PostgresSQL instance.
  2. Retain a copy of the Helm values file used to deploy your cluster.
     helm get values <release-name> > /path/to/values.yaml
  3. Pause or queue/divert any external automated process ingressing data to Pachyderm input repos.
  4. Suspend all mutation of state by scaling pachd and the worker pods down.
    1. Ensure you are using the right context.
      kubectl config get-contexts
      kubectl config use-context <context-name>
    2. Scale down the pachd deployment and the worker pods.
      kubectl scale deployment pachd --replicas 0 
      kubectl scale rc --replicas 0 -l suite=pachyderm,component=worker
    3. Monitor the state of pachd and the worker pods.
      watch -n 5 kubectl get pods
    1. Ensure you are using the right context.
      kubectl config get-contexts
      kubectl config use-context <context-name>
    2. Scale down the pachd deployment and the worker pods.
      kubectl scale deployment pachd --replicas 0 
      kubectl scale rc --replicas 0 -l suite=pachyderm,component=worker
    3. Monitor the state of pachd and the worker pods.
      watch -n 5 kubectl get pods
  5. Dump your PostgresSQL state using pg_dumpall or pg_dump depending on whether the database is solely used by Pachyderm or shared with other applications.
    pg_dumpall -U postgres > /path/to/backup.sql
    pg_dump -U postgres -d pachyderm > /path/to/backup.sql
    pg_dump -U postgres -d dex > /path/to/backup.sql
  6. Backup your object store. Refer to your cloud provider’s documentation for details.
On-Premises
For on-premises Kubernetes deployments, check the vendor documentation for backup and restore procedures regarding both your PostgresSQL instance and object store.

How to Resume Operations

Once your backup is completed, resume your normal operations by scaling pachd back up. It will take care of restoring the worker pods:

  • Enterprise: pachctl enterprise unpause.
  • CE: kubectl scale deployment pachd --replicas 1