In this tutorial, we’ll build a scalable inference data parallelism pipeline for breast cancer detection using data parallelism.
Before You Start #
- You must have a HPE Machine Learning Data Management cluster up and running
- You should have some basic familiarity with HPE Machine Learning Data Management pipeline specs – see the Transform, Cross Input, Resource Limits, Resource Requests, and Parallelism sections in particular
Tutorial #
Our Docker image’s user code for this tutorial is built on top of the pytorch/pytorch base image, which includes necessary dependencies. The underlying code and pre-trained breast cancer detection model comes from this repo, developed by the Center of Data Science and Department of Radiology at NYU. Their original paper can be found here.
1. Create a Project & Input Repos #
2. Create a Classification Pipeline #
We’re going to need to first build a pipeline that will classify the breast cancer images. We’ll use a cross input to combine the sample data and models.
Datum Shape #
When you define a glob pattern in your pipeline, you are defining how HPE Machine Learning Data Management should split the data so that the code can execute as parallel jobs without having to modify the underlying implementation.
In this case, we are treating each exam (4 images and a list file) as a single datum. Each datum is processed individually, allowing parallelized computation for each exam that is added. The file structure for our sample_data
is organized as follows:
sample_data/
├── <unique_exam_id>
│ ├── L_CC.png
│ ├── L_MLO.png
│ ├── R_CC.png
│ ├── R_MLO.png
│ └── gen_exam_list_before_cropping.pkl
├── <unique_exam_id>
│ ├── L_CC.png
│ ├── L_MLO.png
│ ├── R_CC.png
│ ├── R_MLO.png
│ └── gen_exam_list_before_cropping.pkl
...
The gen_exam_list_before_cropping.pkl
is a pickled version of the image list, a requirement of the underlying library being used.
3. Upload Dataset #
-
Open or download this github repo.
gh repo clone pachyderm/docs-content
-
Navigate to this tutorial.
cd content/products/mldm/latest/build-dags/tutorials/data-parallelism
-
Upload the
sample_data
andmodels
folders to your repos.
User Code Assets #
The Docker image used in this tutorial was built with the following assets: