Intro to Data Versioning
Introduction to Data Versioning #
On this page we want to give a brief overview of how to use and interact with versioned data inside HPE Machine Learning Data Management. Collectively, this is often referred to as the HPE Machine Learning Data Management File System (PFS).
Repositories #
Data versioning in HPE Machine Learning Data Management starts with creating a data repository. HPE Machine Learning Data Management data repos are similar to Git repositories in that they provide a place to track changes made to a set of files.
Using the HPE Machine Learning Data Management CLI (pachctl
) we would create a repository called data with the create repo command.
pachctl create repo data
Once a repo is created, data can be added, deleted, or updated to a branch and all changes are versioned with commits.
Commits #
In HPE Machine Learning Data Management, commits are made to branches of a repo. For example, in the following session if we add a file to our data repository, that file will be captured in a commit.
$ pachctl put file data@master -f my_file.bin
$ pachctl list commit images@master
REPO BRANCH COMMIT FINISHED SIZE ORIGIN
data master 6806cce 4 seconds ago 57.27KiB USER
$ pachctl list file data@master
NAME TYPE SIZE
/my_file.bin file 57.27KiB
If we then delete that file, it is removed from the active state of the branch, but the commit still exists.
$ pachctl delete file data@master:/my_file.bin
$ pachctl list commit data@master
REPO BRANCH COMMIT FINISHED SIZE ORIGIN
data master ff1867a 3 seconds ago 0B USER
data master 6806cce 20 seconds ago 57.27KiB USER
$ pachctl list file data@master
NAME TYPE SIZE
Then if we add the file back, we’ll see a third commit.
$ pachctl create file data@master:/my_file.bin
$ pachctl list commit data@master
REPO BRANCH COMMIT FINISHED SIZE ORIGIN
data master 0ec029b 20 seconds ago 57.27KiB USER
data master ff1867a 3 seconds ago 0B USER
data master 6806cce 20 seconds ago 57.27KiB USER
$ pachctl list file data@master
NAME TYPE SIZE
/my_file.bin file 57.27KiB
Visualizing the commit history for the master branch looks like the following.
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'gitGraph': {'showBranches': true, 'showCommitLabel':true,'mainBranchName': 'data'}} }%% gitGraph commit id:"6806cce" commit id:"ff1867a" commit id:"0ec029b" tag: "master"
Branches are a critical for tracking commits. The branch functions as a pointer to the most recent commit to the branch. For instance, when we create a new commit on the master branch (pachctl put file data@master -f my_new_file
), we would create a new commit and our master branch would point at it.
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'gitGraph': {'showBranches': true, 'showCommitLabel':true,'mainBranchName': 'data'}} }%% gitGraph commit id:"6806cce" commit id:"ff1867a" commit id:"0ec029b" commit id:"b69b3e3" tag: "master"
As we’ve already seen, we can reference the HEAD of the branch, with the syntax, data@master
.
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'gitGraph': {'showBranches': true, 'showCommitLabel':true,'mainBranchName': 'data@master'}} }%% gitGraph commit id:"6806cce" commit id:"ff1867a" commit id:"0ec029b" commit id:"b69b3e3" type:HIGHLIGHT tag: "HEAD"
Navigating Commits #
Here we’ll introduce the basics of how to navigate commits. Navigating these commits is an important aspect of working with PFS, and allows you to easily manage the history and evolution of your data.
One useful feature for navigating commits in PFS is the ability to refer to a previous commit using ancestry syntax. This syntax allows you to specify a commit relative to the current one, making it easy to compare and manipulate different versions of your data.
This makes it simple to switch between different versions of your data, and to perform operations like diffing, branching, and merging.
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'gitGraph': {'showBranches': true, 'showCommitLabel':true,'mainBranchName': 'data@master^'}} }%% gitGraph commit id:"6806cce" commit id:"ff1867a" commit id:"0ec029b" type:HIGHLIGHT commit id:"b69b3e3" tag: "HEAD"
To refer to the commit 2 before the HEAD:
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'gitGraph': {'showBranches': true, 'showCommitLabel':true,'mainBranchName': 'data@master^^'}} }%% gitGraph commit id:"6806cce" commit id:"ff1867a" type:HIGHLIGHT commit id:"0ec029b" commit id:"b69b3e3" tag: "HEAD"
Similarly, we can abbreviate this with the following syntax:
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'gitGraph': {'showBranches': true, 'showCommitLabel':true,'mainBranchName': 'data@master^2'}} }%% gitGraph commit id:"6806cce" commit id:"ff1867a" type:HIGHLIGHT commit id:"0ec029b" commit id:"b69b3e3" tag: "HEAD"
We can reference the commits in numerical order using .n
, where n
is the commit number.
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'gitGraph': {'showBranches': true, 'showCommitLabel':true,'mainBranchName': 'data@master.1'}} }%% gitGraph commit id:"6806cce" type:HIGHLIGHT commit id:"ff1867a" commit id:"0ec029b" commit id:"b69b3e3" tag: "HEAD"
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'gitGraph': {'showBranches': true, 'showCommitLabel':true,'mainBranchName': 'data@master.-1'}} }%% gitGraph commit id:"6806cce" commit id:"ff1867a" commit id:"0ec029b" type:HIGHLIGHT commit id:"b69b3e3" tag: "HEAD"
Branches #
In HPE Machine Learning Data Management, branches are used to track changes in a repository. You can think of a branch as a tag on a specific commit. Branches are associated with a particular commit and are updated as new commits are made (moving the HEAD
of that branch to its most recent commit). This also means that at any time, you can change the commit that a branch is associated with, affecting branch history.
Here’s an example of a repo with three branches, each with its own history of commits:
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'gitGraph': {'showBranches': true, 'showCommitLabel':true,'mainBranchName': 'master'}} }%% gitGraph commit commit branch v1.0 commit commit commit branch v1.1 commit commit commit tag:"v1.1:HEAD" checkout v1.0 commit tag:"v1.0:HEAD" checkout master commit tag:"master:HEAD"
“Merging” Branches #
The concept of merging binary data from different commits is complex. Ultimately, there are too many edge cases to do it reliably for every type of binary data, because computing a diff between two commits is ultimately meaningless unless you know how to compare the data. For example, we know that text files can be compared line-by-line or a bitmap image pixel by pixel, but how would we compute a diff for, say, binary model files?
Additionally, the output of a merge is usually a master copy, the official set of files desired. We rarely combine multiple pieces of image data to make one image, and if we are, we have usually created a technique for doing so. In the end, some files will be deleted, some updated, and some added.
Instead, merging data, means creating a new commit with the desired combination of files and pointing our branch at that commit. In order to maintain a proper history, we would also want to make sure that the parent of that commit is relevant to what we want as well.
For example, in this situation, we have created a branch, dev
, based on the 1-2833cd3 commit. We have committed multiple times to the dev branch, but nothing to master.
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'gitGraph': {'showBranches': true, 'showCommitLabel':true,'mainBranchName': 'master'}} }%% gitGraph commit id:"0-96e9b89" commit id:"1-2833cd3" tag:"master:HEAD" branch dev commit id:"2-25a8daf" commit id:"3-6413afc" commit id:"4-41a750b" tag:"dev:HEAD"
In this case it is simple to simply move the master branch to follow the most recent commit on dev, 4-41a750b
.
pachctl create branch data@master --head 41a750b
Which would look like this:
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'gitGraph': {'showBranches': true, 'showCommitLabel':true,'mainBranchName': 'master'}} }%% gitGraph commit id:"0-96e9b89" commit id:"1-2833cd3" branch dev commit id:"2-25a8daf" commit id:"3-6413afc" commit id:"4-41a750b" tag:"master:HEAD, dev:HEAD"
Or from the history perspective of the respective branches:
%%{init: { 'logLevel': 'debug', 'theme': 'base', 'gitGraph': {'showBranches': true, 'showCommitLabel':true,'mainBranchName': 'master'}} }%% gitGraph commit id:"0-96e9b89" commit id:"1-2833cd3" branch dev commit id:"2-25a8daf" commit id:"3-6413afc" checkout dev commit tag:"dev:HEAD" id:"4-41a750b" checkout master commit id:"2-25a8daf " commit id:"3-6413afc " commit tag:"master:HEAD" id:"4-41a750b "
Branches are useful for many reasons, but in HPE Machine Learning Data Management they also form the foundation of the pipeline system. New commits on branches can be used to trigger pipelines to run, resulting in one of the key differentiators, data-driven pipelines.