Beginner Tutorial:
Create Pipelines

Part 4: Create the Pipelines

Pipelines make up the bulk of your data processing workflow. Together, they act as a series of stages that transform data from one state to another based on the user code included in their Docker image. A pipeline’s overall behavior is defined by its Pipeline Specification (PPS).

For this tutorial, we’ll use pre-defined Docker images and pipeline specifications so that you can focus on learning the Pachyderm-specific concepts.


Video Converter Pipeline

We want to make sure that our DAG can handle videos in multiple formats, so first we’ll create a pipeline that will:

  • Skip images
  • Skip videos already in the correct format (.mp4)
  • Convert videos to .mp4 format

The converted videos will be made available to the next pipeline in the DAG via the video_mp4_converter repo by declaring in the user code to save all converted images to /pfs/out/. This is the standard location for storing output data so that it can be accessed by the next pipeline in the DAG.

Reference Assets

Info
Every pipeline, at minimum, needs a name, an input, and a transform. The input is the data that the pipeline will process, and the transform is the user code that will process the data. transform.image is the Docker image available in a container registry ( Docker Hub) that will be used to run the user code. transform.cmd is the command that will be run inside the Docker container; it is the entrypoint for the user code to be executed against the input data.

Image Flattener Pipeline

Next, we’ll create a pipeline that will flatten the videos into individual .png image frames. Like the previous pipeline, the user code outputs the frames to /pfs/out so that the next pipeline in the DAG can access them in the image_flattener repo.

Reference Assets


Image Tracing Pipeline

Up until this point, we’ve used a simple single input from the Pachyderm file system (input.pfs) and a basic glob pattern (/*) to specify shape of our datums. This particular pattern treats each top-level file and directory as a single datum. However, in this pipeline, we have some special requirements:

  • We want to process only the raw images from the raw_videos_and_images repo
  • We want to process all of the flattened video frame images from the image_flattener pipeline

To achieve this, we’re going to need to use a union input (input.union) to combine the two inputs into a single input for the pipeline.

  • For the raw_videos_and_images input, we can use a more powerful glob pattern to ensure that only image files are processed (/*.{png,jpg,jpeg})
  • For the image_flattener input, we can use the same glob pattern as before (/*) to ensure that each video’s collection of frames is processed together

Notice how we also update the transform.cmd to accommodate having two inputs.

Reference Assets

Info
Since this pipeline is converting videos to video frames, it may take a few minutes to complete.

Gif Pipeline

Next, we’ll create a pipeline that will create two gifs:

  1. A gif of the original video’s flattened frames (from the image_flattener output repo)
  2. A gif of the video’s traced frames (from the image_tracer output repo)

To make a gif of both the original video frames and the traced frames, we’re going to again need to use a union input so that we can process the image_flattener and image_tracer output repos.

Notice that the glob pattern has changed; here, we want to treat each directory in an input as a single datum, so we use the glob pattern /*/. This is because we’ve declared in the user code to store the video frames in a directory with the same name as the video file.

Reference Assets

Info
Since this pipeline is converting video frames to gifs, it may take a few minutes to complete.

Content Shuffler Pipeline

We have everything we need to make the comparison collage, but before we do that we need to re-shuffle the content so that the original images and gifs are in one directory (originals) and the traced images and gifs are in another directory (edges). This will help us more easily process the data via our user code for the collage. This is a common step you will encounter while using HPE Machine Learning Data Management referred to as a shuffle pipeline.

Reference Assets


Content Collager Pipeline

Finally, we’ll create a pipeline that produces a static html page for viewing the original and traced content side-by-side.

Reference Assets