Skip to content

Dataset adapter

Dataset Adapter Module.

This module is used to detect the dataset type.

Authors

Christina Alexandra (christina.alexandra@gdplabs.id)

References

NONE

DatasetConfig(type, prefix=None, extensions=None, required_params=None, factory_func=None, name_extractor=None) dataclass

Configuration for dataset detection and creation.

DatasetRegistry()

Registry for dataset configurations.

Initialize the dataset registry.

Parameters:

Name Type Description Default
_configs Dict[DatasetType, DatasetConfig]

The dictionary of dataset configurations.

required
_prefix_map Dict[str, DatasetType]

The dictionary of dataset prefixes.

required
_extension_map Dict[str, DatasetType]

The dictionary of dataset extensions.

required

detect_type(dataset)

Detect the dataset type from the dataset string.

Parameters:

Name Type Description Default
dataset str

The dataset string to detect.

required

Returns:

Type Description
Optional[DatasetType]

Optional[DatasetType]: The detected dataset type, or None if not supported.

get_config(dataset_type)

Get configuration for a dataset type.

Parameters:

Name Type Description Default
dataset_type DatasetType

The dataset type to get the configuration for.

required

Returns:

Type Description
Optional[DatasetConfig]

Optional[DatasetConfig]: The configuration for the dataset type, or None if not found.

register_config(config)

Register a dataset configuration.

Parameters:

Name Type Description Default
config DatasetConfig

The dataset configuration to register.

required

DatasetType

Bases: Enum

Enumeration of supported dataset types.

create_dataset(dataset, dataset_type, **kwargs)

Create a dataset instance based on the detected type.

Parameters:

Name Type Description Default
dataset str

The dataset string.

required
dataset_type DatasetType

The detected dataset type.

required
**kwargs Any

Additional arguments for dataset creation.

{}

Returns:

Type Description
Tuple[BaseDataset, str]

Tuple[BaseDataset, str]: The created dataset and its name.

Raises:

Type Description
ValueError

If the dataset type is not supported or required parameters are missing.

detect_dataset(dataset, **kwargs)

Detect the dataset type and create an instance.

Supported dataset string format
  • hf/ (Hugging Face dataset)
  • langfuse/ (Langfuse dataset) Required parameters: langfuse_client
  • gs/ (Google Sheets dataset) Required parameters: sheet_id, client_email, private_key
  • (JSONL file)
  • (CSV file)

Parameters:

Name Type Description Default
dataset str | BaseDataset

The dataset to detect.

required
**kwargs Any

Additional arguments to pass to the dataset constructor.

{}

Returns:

Name Type Description
BaseDataset BaseDataset

The detected dataset.

str str

The dataset name.

Raises:

Type Description
ValueError

If the dataset is not supported.

detect_dataset_type(dataset)

Detect the dataset type from the dataset string.

Parameters:

Name Type Description Default
dataset str

The dataset string to detect.

required

Returns:

Type Description
Optional[DatasetType]

Optional[DatasetType]: The detected dataset type, or None if not supported.

register_dataset_type(config)

Register a new dataset type configuration.

Parameters:

Name Type Description Default
config DatasetConfig

The configuration for the new dataset type.

required