Data Caching (CDRs)
Pachyderm’s Common Data Refs (CDRs) feature optimizes the handling of large, remote datasets in pipeline operations. Instead of downloading entire datasets for each pipeline run, CDRs enable local caching of data, significantly improving efficiency and performance. This approach offers several benefits:
- Reduces time spent on data transfer
- Minimizes network usage
- Enables faster pipeline execution
- Allows for efficient handling of version-controlled data
By leveraging CDRs, Pachyderm provides a solution that balances the need for up-to-date data with the performance advantages of local data access, making it ideal for workflows involving large, frequently-used datasets.
Before You Start #
Usage of the Common Data Refs (CDRs) feature requires the following:
- You must use Pachyderm version
2.11.0+
- You must install the
cdr
extras package for the Pachyderm SDKpachyderm_sdk[cdr]==2.11.0+
- You must be using a storage backend that is S3-compatible.
How to Cache Data via Common Data Refs (CDRs) #
The following high-level walkthrough uses the Jupyterlab Extension to create a pipeline and define your user code.
- Create an input repo with your files. For example,
default/cdrs-demo-input
. - Add
pachyderm_sdk[cdr]==2.11.4
to yourrequirements.txt
file. - Create a notebook. For example,
notebook.ipynb
. - Add the following imports and define a cache location.
import os from pachyderm_sdk import Client from pachyderm_sdk.api import pfs, storage CACHE_LOCATION = os.path.join(os.getcwd(), "/cache")
- Obtain the required Pipeline Worker environment variables
FILESET_ID
andPACH_DATUM_ID
needed to assemble the fileset.fileset_id = os.environ['FILESET_ID'] datum_path = f"/pfs/{os.environ['PACH_DATUM_ID']}"
- Initialize the Pachyderm client.
client = Client( host='192.168.64.3', port=80, auth_token=os.environ['PACH_TOKEN'], )
- Assemble the fileset.
client.storage.assemble_fileset( fileset_id, path=datum_path, cache_location=CACHE_LOCATION, destination="/pfs/out/", fetch_missing_chunks=True, )
- Create an input spec with the following details:
pfs: name: default_cdrs-demo-input_master repo: cdrs-demo-input glob: /* empty_files: true # required
- Create and run the pipeline with the specified input spec and the notebook you created.