Environment Variables

You can define environment variables that handle required configuration. In HPE Machine Learning Data Management, you can define the following types of environment variables:

pachd variables: Used for your HPE Machine Learning Data Management daemon container.
HPE Machine Learning Data Management worker variables: Used by the Kubernetes pods that run your pipeline code.

Tip

You can reference environment variables in your code. For example, if your code writes data to an external system and you want to know the current job ID, you can use the PACH_JOB_ID environment variable to refer to the current job ID.

`pachd` Environment Variables #

You can find the list of pachd environment variables in the pachd manifest by running the following command:

kubectl get deploy pachd -o yaml

The following tables list all the pachd environment variables.

Global Configuration #

Environment Variable	Default Value	Description
`ETCD_SERVICE_HOST`	N/A	The host on which the etcd service runs.
`ETCD_SERVICE_PORT`	N/A	The etcd port number.
`PPS_WORKER_GRPC_PORT`	`80`	The GRPs port number.
`PORT`	`650`	The `pachd` port number.
`HTTP_PORT`	`652`	The HTTP port number.
`PEER_PORT`	`653`	The port for pachd-to-pachd communication.
`NAMESPACE`	`default`	The namespace in which HPE Machine Learning Data Management is deployed.

PachD Configuration #

Environment Variable	Default Value	Description
`NUM_SHARDS`	`32`	The max number of `pachd` pods that can run in a single cluster.
`STORAGE_BACKEND`	`""`	The storage backend defined for the HPE Machine Learning Data Management cluster.
`STORAGE_HOST_PATH`	`""`	The host path to storage.
`KUBERNETES_PORT_443_TCP_ADDR`	`none`	An IP address that Kubernetes exports automatically for your code to communicate with the Kubernetes API. Read access only. Most variables that have use the `PORT_ADDRESS_TCP_ADDR` pattern are Kubernetes environment variables. For more information, see Kubernetes environment variables.
`METRICS`	`true`	Defines whether anonymous HPE Machine Learning Data Management metrics are being collected or not.
`BLOCK_CACHE_BYTES`	`1G`	The size of the block cache in `pachd`.
`WORKER_IMAGE`	`""`	The base Docker image that is used to run your pipeline.
`WORKER_SIDECAR_IMAGE`	`""`	The `pachd` image that is used as a worker sidecar.
`WORKER_IMAGE_PULL_POLICY`	`IfNotPresent`	The pull policy that defines how Docker images are pulled. You can set a Kubernetes image pull policy as needed.
`LOG_LEVEL`	`info`	Verbosity of the log output. If you want to disable logging, set this variable to `0`. Viable Options `debug` `info` `error` For more information, see Go logrus log levels.
`IAM_ROLE`	`""`	The role that defines permissions for HPE Machine Learning Data Management in AWS.
`IMAGE_PULL_SECRET`	`""`	The Kubernetes secret for image pull credentials.
`EXPOSE_OBJECT_API`	`false`	Controls access to internal HPE Machine Learning Data Management API.
`WORKER_USES_ROOT`	`true`	Controls root access in the worker container.
`S3GATEWAY_PORT`	`600`	The S3 gateway port number
`DISABLE_COMMIT_PROGRESS_COUNTER`	`false`	A feature flag that disables commit propagation progress counter. If you have a large DAG, setting this parameter to `true` might help improve etcd performance. You only need to set this parameter on the `pachd` pod. HPE Machine Learning Data Management passes this parameter to worker containers automatically.

Storage Configuration #

Environment Variable	Default Value	Description
`STORAGE_MEMORY_THRESHOLD`	N/A	Defines the storage memory threshold.
`STORAGE_SHARD_THRESHOLD`	N/A	Defines the storage shard threshold.

Pipeline Worker Environment Variables #

HPE Machine Learning Data Management defines many environment variables for each HPE Machine Learning Data Management worker that runs your pipeline code. You can print the list of environment variables into your HPE Machine Learning Data Management logs by including the env command into your pipeline specification. For example, if you have an images repository, you can configure your pipeline specification like this:

{
    "pipeline": {
        "name": "env"
    },
    "input": {
        "pfs": {
            "glob": "/",
            "repo": "images"
        }
    },
    "transform": {
        "cmd": ["sh" ],
        "stdin": ["env"],
        "image": "ubuntu:14.04"
    }
}

Run this pipeline and, upon completion, you can view the log with variables by running the following command:

pachctl logs --pipeline=env
PPS_WORKER_IP=172.17.0.7
DASH_PORT_8081_TCP_PROTO=tcp
PACHD_PORT_600_TCP_PORT=600
KUBERNETES_SERVICE_PORT=443
KUBERNETES_PORT=tcp://10.96.0.1:443
...

You should see a lengthy list of variables. Many of them define internal networking parameters that most probably you will not need to use.

Most users find the following environment variables particularly useful:

Environment Variable	Description
`AWS_ACCESS_KEY_ID`	The ID that contains your AWS access key; requires `pfs.s3: true` or `s3Out:true` in your pipeline spec.
`AWS_SECRET_ACCESS_KEY`	The name of the secret which contains your AWS access key; requires `pfs.s3: true` or `s3Out:true` in your pipeline spec.
`PACH_JOB_ID`	The ID of the current job. For example, `PACH_JOB_ID=8991d6e811554b2a8eccaff10ebfb341`.
`PACH_DATUM_ID`	The ID of the current Datum.
`FILESET_ID`	The ID of the file set which contains the input files for a given job.
`PACHD_PEER_SERVICE_HOST`	The host on which a pachd peer service runs. Used by the Pachyderm SDK.
`PACHD_PEER_SERVICE_PORT`	The port number of a pachd peer service. Used by the Pachyderm SDK.
`PACH_DATUM_<input.name>_JOIN_ON`	Exposes the `join_on` match to the pipeline’s job.
`PACH_DATUM_<input.name>_GROUP_BY`	Expose the `group_by` match to the pipeline’s job.
`PACH_OUTPUT_COMMIT_ID`	The ID of the commit in the output repo for the current job. For example, `PACH_OUTPUT_COMMIT_ID=a974991ad44d4d37ba5cf33b9ff77394`.
`PPS_NAMESPACE`	The PPS namespace. For example, `PPS_NAMESPACE=default`.
`PPS_SPEC_COMMIT`	The hash of the pipeline specification commit. This value is tied to the pipeline version. Therefore, jobs that use the same version of the same pipeline have the same spec commit. For example, `PPS_SPEC_COMMIT=3596627865b24c4caea9565fcde29e7d`.
`PPS_POD_NAME`	The name of the pipeline pod. For example, `pipeline-env-v1-zbwm2`.
`PPS_PIPELINE_NAME`	The name of the pipeline that this pod runs. For example, `env`.
`PIPELINE_SERVICE_PORT_PROMETHEUS_METRICS`	The port that you can use to exposed metrics to Prometheus from within your pipeline. The default value is 9090.
`HOME`	The path to the home directory. The default value is `/root`
`<input-repo>=<path/to/input/repo>`	The path to the filesystem that is defined in the `input` in your pipeline specification. HPE Machine Learning Data Management defines such a variable for each input. The path is defined by the `glob` pattern in the spec. For example, if you have an input `images` and a glob pattern of `/`, HPE Machine Learning Data Management defines the `images=/pfs/images` variable. If you have a glob pattern of `/*`, HPE Machine Learning Data Management matches the files in the `images` repository and, therefore, the path is `images=/pfs/images/liberty.png`.
`input_COMMIT`	The ID of the commit that is used for the input. For example, `images_COMMIT=fa765b5454e3475f902eadebf83eac34`.
`S3_ENDPOINT`	A HPE Machine Learning Data Management S3 gateway sidecar container endpoint. If you have an S3 enabled pipeline, this parameter specifies a URL that you can use to access the pipeline’s repositories state when a particular job was run. The URL has the following format: `http://<job-ID>-s3:600`. An example of accessing the data by using AWS CLI looks like this: `echo foo_data

In addition to these environment variables, Kubernetes injects others for Services that run inside the cluster. These variables enable you to connect to those outside services, which can be powerful but might also result in processing being retried multiple times.

For example, if your code writes a row to a database, that row might be written multiple times because of retries. Interaction with outside services must be idempotent to prevent unexpected behavior. Furthermore, one of the running services that your code can connect to is HPE Machine Learning Data Management itself. This is generally not recommended as very little of the HPE Machine Learning Data Management API is idempotent, but in some specific cases it can be a viable approach.