Create a Pipeline

To create a pipeline, you need to define a pipeline specification in YAML, JSON, or Jsonnet.

Before You Start #

A basic pipeline must have all of the following:

pipeline.name: The name of your pipeline.
transform.cmd: The command that executes your user code.
transform.img: The image that contains your user code.
input.pfs.repo: The output repository for the transformed data.
input.pfs.glob: The glob pattern used to identify the shape of datums.

How to Create a Pipeline #

Info

You can define multiple pipeline specifications in one file by separating the specs with the following separator: ---. This works in both JSON and YAML files.

CLI #

Define a pipeline specification in YAML, JSON, or Jsonnet.
Pass the pipeline configuration to HPE Machine Learning Data Management:
```
pachctl create pipeline -f <pipeline_spec>
```

Find a pipeline specification hosted in a public or internal repository.

Pass the pipeline configuration to HPE Machine Learning Data Management:

pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/2.12.x/examples/opencv/edges.json

Jsonnet Pipeline specs let you create pipelines while passing a set of parameters dynamically, allowing you to reuse the baseline of a given pipeline while changing the values of chosen fields. You can, for example, create multiple pipelines out of the same jsonnet pipeline spec file while pointing each of them at different input repositories, parameterize a command line in the transform field of your pipelines, or dynamically pass various docker images to train different models on the same dataset.

For illustration purposes, in the following example, we are creating a pipeline named edges-1 and pointing its input repository at the repo ‘images’:

pachctl create pipeline --jsonnet jsonnet/edges.jsonnet --arg suffix=1 --arg src=images

Console #

You can define a pipeline spec in JSON directly in the console UI.

Authenticate to HPE Machine Learning Data Management or access Console via Localhost.
Scroll through the project list to find a project you want to view.
Select View Project.
Select Create > Pipeline from the sidebar.
Define a pipeline spec in JSON and ensure it’s valid.
Review any cluster or project defaults that will be applied to the pipeline and overwrite them if necessary.
Select Create Pipeline.

You can create a pipeline by referencing a templated pipeline spec file in the console UI. This is a very powerful and fast way of creating pipelines that follow set standards and best practices for your organization.

Pipeline templates support jsonnet.

Authenticate to HPE Machine Learning Data Management or access Console via Localhost.
Scroll through the project list to find a project you want to view.
Select View Project.
Select Create > Pipeline from template from the sidebar.
Provide a valid path to the pipeline spec file.
Select Continue.
Fill out any populated fields from the pipeline spec file and verify if default values are correct.
Select Create Pipeline.

Examples #

JSON #

{
  "pipeline": {
    "name": "edges"
  },
  "description": "A pipeline that performs image edge detection by using the OpenCV library.",
  "transform": {
    "cmd": [ "python3", "/edges.py" ],
    "image": "pachyderm/opencv"
  },
  "input": {
    "pfs": {
      "repo": "images",
      "glob": "/*"
    }
  }
}

YAML #

pipeline:
  name: edges
description: A pipeline that performs image edge detection by using the OpenCV library.
transform:
  cmd:
  - python3
  - "/edges.py"
  image: pachyderm/opencv
input:
  pfs:
    repo: images
    glob: "/*"

Considerations #

When you create a pipeline, HPE Machine Learning Data Management automatically creates an eponymous output repository. However, if such a repo already exists, your pipeline will take over the master branch. The files that were stored in the repo before will still be in the HEAD of the branch.