Create a Pipeline

To create a pipeline, you need to define a pipeline specification in YAML, JSON, or Jsonnet.

Before You Start

A basic pipeline must have all of the following:

  • pipeline.name: The name of your pipeline.
  • transform.cmd: The command that executes your user code.
  • transform.img: The image that contains your user code.
  • input.pfs.repo: The output repository for the transformed data.
  • input.pfs.glob: The glob pattern used to identify the shape of datums.

How to Create a Pipeline

Via Local File

  1. Define a pipeline specification in YAML, JSON, or Jsonnet.

  2. Pass the pipeline configuration to HPE Machine Learning Data Management:

    pachctl create pipeline -f <pipeline_spec>

Via URL

  1. Find a pipeline specification hosted in a public or internal repository.
  2. Pass the pipeline configuration to HPE Machine Learning Data Management:
    pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/2.10.x/examples/opencv/edges.json

Via Jsonnet

Jsonnet Pipeline specs let you create pipelines while passing a set of parameters dynamically, allowing you to reuse the baseline of a given pipeline while changing the values of chosen fields. You can, for example, create multiple pipelines out of the same jsonnet pipeline spec file while pointing each of them at different input repositories, parameterize a command line in the transform field of your pipelines, or dynamically pass various docker images to train different models on the same dataset.

For illustration purposes, in the following example, we are creating a pipeline named edges-1 and pointing its input repository at the repo ‘images’:

pachctl create pipeline --jsonnet jsonnet/edges.jsonnet --arg suffix=1 --arg src=images
Info
You can define multiple pipeline specifications in one file by separating the specs with the following separator: ---. This works in both JSON and YAML files.

Examples

JSON

{
  "pipeline": {
    "name": "edges"
  },
  "description": "A pipeline that performs image edge detection by using the OpenCV library.",
  "transform": {
    "cmd": [ "python3", "/edges.py" ],
    "image": "pachyderm/opencv"
  },
  "input": {
    "pfs": {
      "repo": "images",
      "glob": "/*"
    }
  }
}

YAML

pipeline:
  name: edges
description: A pipeline that performs image edge detection by using the OpenCV library.
transform:
  cmd:
  - python3
  - "/edges.py"
  image: pachyderm/opencv
input:
  pfs:
    repo: images
    glob: "/*"

Considerations

  • When you create a pipeline, HPE Machine Learning Data Management automatically creates an eponymous output repository. However, if such a repo already exists, your pipeline will take over the master branch. The files that were stored in the repo before will still be in the HEAD of the branch.