Experiment Tracker
Experiment tracker module.
References
NONE
BaseExperimentTracker(project_name, **kwargs)
Bases: ABC
Base class for all experiment trackers.
This class defines the core interface for experiment tracking across different backends. It provides methods for logging individual results and batch results using the observability pattern.
Attributes:
Name | Type | Description |
---|---|---|
project_name |
str
|
The name of the project. |
Initialize the experiment tracker.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
project_name |
str
|
The name of the project. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)
abstractmethod
async
Log a single evaluation result (asynchronous).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
Optional[str]
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
Optional[Dict[str, Any]]
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
get_experiment_history(**kwargs)
Get all experiment runs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
Type | Description |
---|---|
List[Dict[str, Any]]
|
List[Dict[str, Any]]: List of experiment runs. |
get_run_results(run_id, **kwargs)
Get detailed results for a specific run.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
Type | Description |
---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: Detailed run results. |
log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)
abstractmethod
Log a single evaluation result (synchronous).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
Optional[str]
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
Optional[Dict[str, Any]]
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs)
abstractmethod
async
Log a batch of evaluation results (asynchronous).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
evaluation_results |
List[EvaluationOutput]
|
The evaluation results to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
List[MetricInput]
|
List of input data that was evaluated. |
required |
run_id |
Optional[str]
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
Optional[Dict[str, Any]]
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
log_context(**kwargs)
async
Log a context.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
SimpleExperimentTracker(project_name, output_dir='./gllm_evals/experiments', score_key='score')
Bases: BaseExperimentTracker
Simple file-based experiment tracker for development and testing.
This class provides a simple local storage implementation for experiment tracking. It stores experiment data in CSV format with two files: experiment_results.csv and leaderboard.csv.
Attributes:
Name | Type | Description |
---|---|---|
project_name |
str
|
The name of the project. |
output_dir |
Path
|
Directory to store experiment results. |
experiment_results_file |
Path
|
CSV file for experiment results. |
leaderboard_file |
Path
|
CSV file for leaderboard data. |
Initialize simple tracker with project name and output directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
project_name |
str
|
The name of the project. |
required |
output_dir |
str
|
Directory to store experiment results. |
'./gllm_evals/experiments'
|
score_key |
Union[str, List[str]]
|
The key to extract scores from evaluation results. - If str: Direct key access (e.g., "score") or dot notation (e.g., "metrics.accuracy") - If List[str]: Nested key path (e.g., ["generation", "score"]) Defaults to "score". |
'score'
|
alog(evaluation_result, dataset_name, data, run_id=None, metadata=None)
async
Log a single evaluation result (asynchronous).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
Optional[str]
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
Optional[Dict[str, Any]]
|
Additional metadata to log. |
None
|
get_experiment_history()
Get all experiment runs from leaderboard.
Returns:
Type | Description |
---|---|
List[Dict[str, Any]]
|
List[Dict[str, Any]]: List of experiment runs. |
get_run_results(run_id)
Get detailed results for a specific run.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
Returns:
Type | Description |
---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: Detailed run results. |
log(evaluation_result, dataset_name, data, run_id=None, metadata=None)
Log a single evaluation result.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
Optional[str]
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
Optional[Dict[str, Any]]
|
Additional metadata to log. |
None
|
log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None)
async
Log a batch of evaluation results (asynchronous).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
evaluation_results |
List[EvaluationOutput]
|
The evaluation results to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
List[MetricInput]
|
List of input data that was evaluated. |
required |
run_id |
Optional[str]
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
Optional[Dict[str, Any]]
|
Additional metadata to log. |
None
|