Skip to content

Experiment Tracker

Experiment tracker module.

Authors

Apri Dwi Rachmadi (apri.d.rachmadi@gdplabs.id)

References

NONE

BaseExperimentTracker(project_name, **kwargs)

Bases: ABC

Base class for all experiment trackers.

This class defines the core interface for experiment tracking across different backends. It provides methods for logging individual results and batch results using the observability pattern.

Attributes:

Name Type Description
project_name str

The name of the project.

Initialize the experiment tracker.

Parameters:

Name Type Description Default
project_name str

The name of the project.

required
**kwargs Any

Additional configuration parameters.

{}

alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod async

Log a single evaluation result (asynchronous).

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id Optional[str]

ID of the experiment run. Can be auto-generated if None.

None
metadata Optional[Dict[str, Any]]

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

get_experiment_history(**kwargs)

Get all experiment runs.

Parameters:

Name Type Description Default
**kwargs Any

Additional configuration parameters.

{}

Returns:

Type Description
List[Dict[str, Any]]

List[Dict[str, Any]]: List of experiment runs.

get_run_results(run_id, **kwargs)

Get detailed results for a specific run.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: Detailed run results.

log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod

Log a single evaluation result (synchronous).

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id Optional[str]

ID of the experiment run. Can be auto-generated if None.

None
metadata Optional[Dict[str, Any]]

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod async

Log a batch of evaluation results (asynchronous).

Parameters:

Name Type Description Default
evaluation_results List[EvaluationOutput]

The evaluation results to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data List[MetricInput]

List of input data that was evaluated.

required
run_id Optional[str]

ID of the experiment run. Can be auto-generated if None.

None
metadata Optional[Dict[str, Any]]

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

log_context(**kwargs) async

Log a context.

Parameters:

Name Type Description Default
**kwargs Any

Additional configuration parameters.

{}

SimpleExperimentTracker(project_name, output_dir='./gllm_evals/experiments', score_key='score')

Bases: BaseExperimentTracker

Simple file-based experiment tracker for development and testing.

This class provides a simple local storage implementation for experiment tracking. It stores experiment data in CSV format with two files: experiment_results.csv and leaderboard.csv.

Attributes:

Name Type Description
project_name str

The name of the project.

output_dir Path

Directory to store experiment results.

experiment_results_file Path

CSV file for experiment results.

leaderboard_file Path

CSV file for leaderboard data.

Initialize simple tracker with project name and output directory.

Parameters:

Name Type Description Default
project_name str

The name of the project.

required
output_dir str

Directory to store experiment results.

'./gllm_evals/experiments'
score_key Union[str, List[str]]

The key to extract scores from evaluation results. - If str: Direct key access (e.g., "score") or dot notation (e.g., "metrics.accuracy") - If List[str]: Nested key path (e.g., ["generation", "score"]) Defaults to "score".

'score'

alog(evaluation_result, dataset_name, data, run_id=None, metadata=None) async

Log a single evaluation result (asynchronous).

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id Optional[str]

ID of the experiment run. Can be auto-generated if None.

None
metadata Optional[Dict[str, Any]]

Additional metadata to log.

None

get_experiment_history()

Get all experiment runs from leaderboard.

Returns:

Type Description
List[Dict[str, Any]]

List[Dict[str, Any]]: List of experiment runs.

get_run_results(run_id)

Get detailed results for a specific run.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: Detailed run results.

log(evaluation_result, dataset_name, data, run_id=None, metadata=None)

Log a single evaluation result.

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id Optional[str]

ID of the experiment run. Can be auto-generated if None.

None
metadata Optional[Dict[str, Any]]

Additional metadata to log.

None

log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None) async

Log a batch of evaluation results (asynchronous).

Parameters:

Name Type Description Default
evaluation_results List[EvaluationOutput]

The evaluation results to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data List[MetricInput]

List of input data that was evaluated.

required
run_id Optional[str]

ID of the experiment run. Can be auto-generated if None.

None
metadata Optional[Dict[str, Any]]

Additional metadata to log.

None