Datasets

Before You Start #

API Reference Quicklinks #

Create/Delete Functions	Get Functions	Write & Misc Functions
`lore.construct_dataset_from_hf`	`lore.get_datasets`	`lore.sample_dataset`
`lore.construct_dataset_from_local`	`lore.get_dataset`	`lore.merge_and_resplit`
`lore.commit_dataset`	`lore.get_dataset_by_name`	`lore.concatenate_datasets`
`lore.delete_dataset`		`lore.compute_metrics`
		`lore.materialize_dataset`
		`lore.apply_prompt_template`

Create a Dataset #

Via Hugging Face #

The following code checks if a dataset exists in the current workspace. If it does not, it constructs a dataset from the Hugging Face Hub and commits it to the current workspace.

GENAI_DATASET_NAME = "customers-complaints"
HF_DATASET_FULLNAME = "hpe-ai/customer-complaints"

try:
    dataset = lore.get_dataset_by_name(GENAI_DATASET_NAME)
except:
    dataset = lore.construct_dataset_from_hf(dataset_name=HF_DATASET_FULLNAME)
    dataset.name = GENAI_DATASET_NAME
    dataset = lore.commit_dataset(dataset)

print(
    f"Dataset: {dataset.name}\n\
         -> examples={dataset.row_count}\n\
         -> registered_in_genai={dataset.registered_in_genai}\n"
)

Via CSV #

The following code checks if a dataset exists in the current workspace. If it does not, it constructs a dataset from a local CSV file and commits it to the current workspace.

from datasets import Dataset, DatasetDict

GENAI_DATASET_NAME = "medical_specialty"

try:
    genai_dataset = lore.get_dataset_by_name(GENAI_DATASET_NAME)
except:
    hf_dataset = DatasetDict({"train": Dataset.from_pandas(df)})
    genai_dataset = lore.construct_dataset_from_local(hf_dataset, dataset_name=GENAI_DATASET_NAME)
    genai_dataset = lore.merge_and_resplit(
        genai_dataset, 0.7, 0.15, 0.15, ["train"], shuffle=True, seed=1234
    )
    genai_dataset = lore.commit_dataset(genai_dataset)

print(
    f"Dataset: {genai_dataset.name}\n\
     -> examples={genai_dataset.row_count}\n\
     -> registered_in_lore={genai_dataset.registered_in_genai}\n"
)

Explore a Dataset #

View Splits and Features #

You can explore the dataset and get a list of its features by materializing it and printing the resulting object.

explore = lore.materialize_dataset(genai_dataset)
print(explore)

DatasetDict({
    train: Dataset({
        features: ['Date_received', 'Product', 'Sub_product', 'Issue', 'Sub_issue', 'Consumer_complaint_narrative', 'Company_public_response', 'Company', 'State', 'ZIP_code', 'Tags', 'Consumer_consent_provided?', 'Submitted_via', 'Date_sent_to_company', 'Company response to consumer', 'Timely_response?', 'Consumer_disputed?', 'Complaint_ID'],
        num_rows: 30000
    })
})

View a Row #

You can view a given row of the dataset by indexing into the materialized dataset.

explore["train"][0]

{'Date_received': '2023-01-29',
 'Product': 'Credit reporting, credit repair services, or other personal consumer reports',
 'Sub_product': 'Credit reporting',
 'Issue': 'Improper use of your report',
 'Sub_issue': 'Reporting company used your report improperly',
 'Consumer_complaint_narrative': 'In accordance with the Fair Credit Reporting act. The List of accounts below has violated my federally protected consumer rights to privacy and confidentiality under 15 USC 1681. \n\nXXXX : # XXXX, XXXX : # XXXX, XXXX : # XXXX, XXXX XXXX  : # XXXX, XXXX XXXX XXXX XXXX  XXXX # XXXX : XXXX XXXX XXXX XXXX # XXXX : XXXX XXXX XXXX # XXXX : XXXX XXXX XXXX XXXX # XXXX has violated my rights. \n\n15 U.S.C 1681 section 602 A. States I have the right to privacy.\n\n15 U.S.C 1681 Section 604 A Section 2 : It also states a consumer reporting agency can not furnish a account without my written instructions 15 U.S.C 1681c. ( a ) ( 5 ) Section States : no consumer reporting agency may make any consumer report containing any of the following items of information Any other adverse item of information, other than records of convictions of crimes which antedates the report by more than seven years.\n\n15 U.S.C. 1681s-2 ( A ) ( 1 ) A person shall not furnish any information relating to a consumer to any consumer reporting agency if the person knows or has reasonable cause to believe that the information is inaccurate.',
 'Company_public_response': 'Company has responded to the consumer and the CFPB and chooses not to provide a public response',
 'Company': 'Experian Information Solutions Inc.',
 'State': 'TX',
 'ZIP_code': '76002',
 'Tags': None,
 'Consumer_consent_provided?': 'Consent provided',
 'Submitted_via': 'Web',
 'Date_sent_to_company': '2023-01-29',
 'Company response to consumer': 'Closed with non-monetary relief',
 'Timely_response?': 'Yes',
 'Consumer_disputed?': None,
 'Complaint_ID': 6502926}

Split a Dataset #

The following code checks if an already split dataset exists in the current workspace. If it does not, it merges and resplits a dataset and commits it to the current workspace.

try:
    dataset = lore.get_dataset_by_name("some-dataset-split")
except:
    dataset = lore.merge_and_resplit(
        dataset=dataset,
        train_ratio=0.8,
        validation_ratio=0.1,
        test_ratio=0.1,
        splits_to_resplit=["train"],
    )
    dataset.name = "some-dataset-split"
    dataset = lore.commit_dataset(dataset)

print(
    f"Dataset: {dataset.name}\n\
         -> examples={dataset.row_count}\n\
         -> registered_in_genai={dataset.registered_in_genai}\n"
)

Sample a Dataset #

The following code samples a dataset and returns it as a new dataset or as a Pandas DataFrame.

As a Dataset #

Set as_dataset=True to return the sample as a new dataset. This is useful when you want to continue working with the sample as a dataset.

# Specify the dataset and sampling parameters
dataset_identifier = "example_dataset"
start_index = 0
number_of_samples = 100
splits = ["train"]
seed = 42
as_dataset = True

# Sample the dataset
sampled_dataset = lore_client.sample_dataset(
    dataset=dataset_identifier,
    start_index=start_index,
    number_of_samples=number_of_samples,
    splits=splits,
    seed=seed,
    as_dataset=as_dataset
)

As a Pandas DataFrame #

Set as_dataset=False to return the sample as a Pandas DataFrame. This is useful when you want to view the sample as a DataFrame.

train_dataset = lore.sample_dataset(
    dataset, number_of_samples=10, as_dataset=False
)["train"]

test_dataset = lore.sample_dataset(
    dataset, number_of_samples=5, as_dataset=False
)["test"]

validation_dataset = lore.sample_dataset(
    dataset, number_of_samples=5, as_dataset=False
)["validation"]

columns_to_display = [
    "generated_text",
    "expected_output",
]

pretty_print_df(train_dataset.loc[:, columns_to_display].head(5))
pretty_print_df(test_dataset.loc[:, columns_to_display].head(5))
pretty_print_df(validation_dataset.loc[:, columns_to_display].head(5))

Merge Datasets #

The following code merges two datasets and commits the merged dataset to the current workspace.


concatenated_dataset_name = "Combined_Dataset"

concatenated_dataset = lore.concatenate_datasets(
    datasets=datasets_to_concatenate,
    name=concatenated_dataset_name,
    register_dataset=True  # Set to True if you want to register the dataset
)

Compute Metrics #

The following code computes metrics on a dataset and returns the computed metrics dataset and the metrics results.

ground_truth_column_name = "actual_labels"  # Replace with your actual ground truth column name
predictions_column_name = "predicted_labels"  # Replace with your predictions column name
metrics = ["exact_match", "another_metric"]  # Replace "another_metric" with actual metrics you need
substr_match = True  # True if substring matching is needed, otherwise False
split = "test"  # Specify if you want to compute metrics on a specific split, e.g., "test"
strip = False  # Set to True to strip whitespace from the beginning and end of strings
lower = False  # Set to True to convert strings to lowercase before comparison

computed_metrics_dataset, metrics_results = lore_client.compute_metrics(
    dataset=dataset,
    ground_truth_column_name=ground_truth_column_name,
    predictions_column_name=predictions_column_name,
    metrics=metrics,
    substr_match=substr_match,
    split=split,
    strip=strip,
    lower=lower,
)