Skip to content

Data Loader

Module for data loading and handling.

This module provides a registry for managing different data source loaders and utility functions for accessing them.

Authors
  • Alfan Dinda Rahmawan (alfan.d.rahmawan@gdplabs.id)
Reviewer
  • Muhammad Afif Al Hawari (muhammad.a.a.hawari@gdplabs.id)
References

NONE

DataLoaderRegistry

Registry for available data loaders.

get_loader(name) classmethod

Get an instantiated data loader by name.

Parameters:

Name Type Description Default
name str

Name of the data loader

required

Returns:

Type Description
BaseDataLoader

An instance of the data loader

Raises:

Type Description
ValueError

If the loader name is not recognized

list_available_loaders() classmethod

List all registered data loader names.

Returns:

Type Description
list[str]

A list of data loader names.

register(name, loader_class) classmethod

Register a data loader class.

Parameters:

Name Type Description Default
name str

Identifier for the loader type

required
loader_class Type[BaseDataLoader]

The data loader class

required

get_data_loader(source_type)

Get a data loader instance by source type.

Parameters:

Name Type Description Default
source_type str

The type of data source

required

Returns:

Type Description
BaseDataLoader

A data loader instance

load_data(experiment_args, cache_timeout=GeneralConstants.CACHE_TIMEOUT)

Load data based on experiment arguments.

This function automatically determines the appropriate data loader to use based on the provided experiment arguments and returns the training, validation, and prompt datasets.

Parameters:

Name Type Description Default
experiment_args ExperimentConfig

The experiment configuration arguments

required
cache_timeout Optional[int]

The time in seconds for which data should be cached. Defaults to CACHE_TIMEOUT (5 minutes). Set to None or 0 to disable caching.

CACHE_TIMEOUT

Returns:

Type Description

tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: A tuple containing: - Training dataset (required) - Validation dataset (optional, empty DataFrame if not provided) - Prompt dataset (required)

Raises:

Type Description
DataSourceNotProvidedError

If no valid data source is specified