Defer Processing via Staging Branch
When you want to load data into HPE Machine Learning Data Management without triggering a pipeline, you can upload it to a staging branch and then submit accumulated changes in one batch by re-pointing the HEAD
of your master
branch to a commit in the staging branch. Let’s see how this works.
How to Use a Staging Branch #
-
Create a repository. For example,
data
.pachctl create repo data
-
Create a
master
branch.pachctl create branch data@master
-
View the created branch:
pachctl list commit data
REPO BRANCH COMMIT FINISHED SIZE ORIGIN DESCRIPTION data master 8090bfb4d4fe44158eac12199c37a591 About a minute ago 0B AUTO
HPE Machine Learning Data Management automatically created an empty
HEAD
commit on the new branch, as you can see from the0B
(zero-byte) size andAUTO
commit origin. -
Commit a file to a staging branch:
pachctl put file data@staging -f <file>
HPE Machine Learning Data Management automatically creates the
staging
branch. Your repo now has 2 branches,staging
andmaster
. In this example, thestaging
name is used, but you can name the branch as you want – and have as many staging branches as you need. -
Verify that the branches were created:
pachctl list branch data
BRANCH HEAD TRIGGER staging f3506f0fab6e483e8338754081109e69 - master 8090bfb4d4fe44158eac12199c37a591 -
The
master
branch still has the sameHEAD
commit. No jobs have started to process the new file, because there are no pipelines that takestaging
as inputs. You can continue to commit tostaging
to add new data to the branch, and the pipeline will not process anything. -
When you are ready to process the data, update the
master
branch to point it to the head of the staging branch:pachctl create branch data@master --head staging
-
List your branches to verify that the master branch’s
HEAD
commit has changed:pachctl list branch data
staging f3506f0fab6e483e8338754081109e69 master f3506f0fab6e483e8338754081109e69
The
master
andstaging
branches now have the sameHEAD
commit. This means that your pipeline has data to process. -
Verify that the pipeline has new jobs:
pachctl list job data@f3506f0fab6e483e8338754081109e69 ID PIPELINE STARTED DURATION RESTART PROGRESS DL UL STATE f3506f0fab6e483e8338754081109e69 data 32 seconds ago Less than a second 0 6 + 0 / 6 108B 24B success
You should see one job that HPE Machine Learning Data Management created for all the changes you have submitted to the
staging
branch, with the same ID. While the commits to thestaging
branch are ancestors of the currentHEAD
inmaster
, they were never the actualHEAD
ofmaster
themselves, so they do not get processed. This behavior works for most of the use cases because commits in HPE Machine Learning Data Management are generally additive, so processing the HEAD commit also processes data from previous commits.