Data loader
Base classes and utilities for data sources.
This module provides the foundational components for data loading from various sources. It contains abstract base classes, common exceptions, and utility functions that are shared across different data loader implementations.
Reviewer
- Muhammad Afif Al Hawari (muhammad.a.a.hawari@gdplabs.id)
References
NONE
BaseDataLoader
Bases: Component, ABC
Base class for dataset loaders.
This class defines the common interface that all data sources must implement. Each concrete data source should provide implementation for loading data from their specific source (CSV files, Google Sheets, etc.).
The interface is designed to be consistent across different data source types, making it easy to switch between sources, add new ones, and maintain the factory pattern architecture.
Uses lazy initialization pattern where configuration is provided at load time rather than initialization time, allowing for lightweight instantiation and flexible configuration changes.
All data sources must implement: 1. Lightweight initialization without configuration 2. Load method that accepts configuration and returns data 3. Proper error handling for configuration and connection issues 4. Optional caching mechanisms for performance optimization
load(experiment_args, cache_timeout=GeneralConstants.CACHE_TIMEOUT)
abstractmethod
Load the dataset and return it as a pandas DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
experiment_args |
ExperimentConfig
|
The experiment arguments containing data source configuration. |
required |
cache_timeout |
Optional[int]
|
The time in seconds for which data should be cached. Defaults to CACHE_TIMEOUT (5 minutes). Set to None or 0 to disable caching. |
CACHE_TIMEOUT
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, DataFrame, DataFrame]
|
tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: Training, validation, and prompt data. |
DataLoaderError
Bases: Exception
Base exception for all data source errors.
This exception serves as the base class for all data source-related errors. It provides a common interface for handling various types of failures that can occur during data source operations such as connection issues, authentication failures, or data retrieval problems.