Data Caching (CDRs)

Pachyderm’s Common Data Refs (CDRs) feature optimizes the handling of large, remote datasets in pipeline operations. Instead of downloading entire datasets for each pipeline run, CDRs enable local caching of data, significantly improving efficiency and performance. This approach offers several benefits:

Reduces time spent on data transfer
Minimizes network usage
Enables faster pipeline execution
Allows for efficient handling of version-controlled data

By leveraging CDRs, Pachyderm provides a solution that balances the need for up-to-date data with the performance advantages of local data access, making it ideal for workflows involving large, frequently-used datasets.

Before You Start #

Usage of the Common Data Refs (CDRs) feature requires the following:

You must use Pachyderm version 2.11.0+
You must install the cdr extras package for the Pachyderm SDK pachyderm_sdk[cdr]==2.11.0+
You must be using a storage backend that is S3-compatible.

How to Cache Data via Common Data Refs (CDRs) #

The following high-level walkthrough uses the Jupyterlab Extension to create a pipeline and define your user code.

Create an input repo with your files. For example, default/cdrs-demo-input.
Add pachyderm_sdk[cdr]==2.12.0 to your requirements.txt file.
Create a notebook. For example, notebook.ipynb.

Add the following imports and define a cache location.

import os
from pachyderm_sdk import Client
from pachyderm_sdk.api import pfs, storage

CACHE_LOCATION = os.path.join(os.getcwd(), "/cache")

Obtain the required Pipeline Worker environment variables FILESET_ID and PACH_DATUM_ID needed to assemble the fileset.
```
fileset_id = os.environ['FILESET_ID']
datum_path = f"/pfs/{os.environ['PACH_DATUM_ID']}"
```

Initialize the Pachyderm client.

client = Client(
    host='192.168.64.3',
    port=80,
    auth_token=os.environ['PACH_TOKEN'],
)

Assemble the fileset.

client.storage.assemble_fileset(
    fileset_id,
    path=datum_path,
    cache_location=CACHE_LOCATION,
    destination="/pfs/out/",
    fetch_missing_chunks=True,
)

Create an input spec with the following details:

pfs:
   name: default_cdrs-demo-input_master
   repo: cdrs-demo-input
   glob: /*
   empty_files: true # required

Create and run the pipeline with the specified input spec and the notebook you created.

Article Summarization

Data Caching (CDRs)

Before You Start #

How to Cache Data via Common Data Refs (CDRs) #