Experiment Tracker
Experiment tracker init file.
BaseExperimentTracker(project_name, **kwargs)
Bases: ABC
Base class for all experiment trackers.
This class defines the core interface for experiment tracking across different backends. It provides methods for logging individual results and batch results using the observability pattern.
Attributes:
| Name | Type | Description |
|---|---|---|
project_name |
str
|
The name of the project. |
Initialize the experiment tracker.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
project_name |
str
|
The name of the project. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)
abstractmethod
async
Log a single evaluation result (asynchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
Optional[str]
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
Optional[Dict[str, Any]]
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
get_experiment_history(**kwargs)
Get all experiment runs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
List[Dict[str, Any]]
|
List[Dict[str, Any]]: List of experiment runs. |
get_run_results(run_id, **kwargs)
Get detailed results for a specific run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: Detailed run results. |
log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs)
abstractmethod
Log a single evaluation result (synchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
Optional[str]
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
Optional[Dict[str, Any]]
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs)
abstractmethod
async
Log a batch of evaluation results (asynchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_results |
List[EvaluationOutput]
|
The evaluation results to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
List[MetricInput]
|
List of input data that was evaluated. |
required |
run_id |
Optional[str]
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
Optional[Dict[str, Any]]
|
Additional metadata to log. |
None
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
log_context(**kwargs)
async
Log a context.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
LangfuseExperimentTracker(score_key=DefaultValues.SCORE_KEY, project_name=DefaultValues.PROJECT_NAME, langfuse_client=None, expected_output_key=DefaultValues.EXPECTED_OUTPUT_KEY, mapping=None)
Bases: BaseExperimentTracker
Experiment tracker for Langfuse.
Attributes:
| Name | Type | Description |
|---|---|---|
langfuse_client |
Langfuse
|
The Langfuse client. |
Initialize the LangfuseExperimentTracker class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
score_key |
str | list[str]
|
The key(s) of the score(s) to log. Defaults to "score". |
SCORE_KEY
|
project_name |
str
|
The name of the project. |
PROJECT_NAME
|
langfuse_client |
Langfuse
|
The Langfuse client. |
None
|
expected_output_key |
str | None
|
The key to extract the expected output from the data. Defaults to "expected_response". |
EXPECTED_OUTPUT_KEY
|
mapping |
dict[str, Any] | None
|
Optional mapping for field keys. Defaults to None. |
None
|
alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, dataset_item_id=None, reason_key=DefaultValues.REASON_KEY, **kwargs)
async
Log an evaluation result to Langfuse asynchronously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
The name of the dataset. |
required |
data |
MetricInput
|
The dataset data. |
required |
run_id |
str | None
|
The run ID. |
None
|
metadata |
dict
|
Additional metadata. |
None
|
dataset_item_id |
str | None
|
The ID of the dataset item. |
None
|
reason_key |
str | None
|
The key to extract. |
REASON_KEY
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
get_run_results(run_id, **kwargs)
Get detailed results for a specific run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
**kwargs |
Any
|
Additional configuration parameters including 'keys' for trace keys. |
{}
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
dict[str, Any] | list[dict[str, Any]]: Detailed run results. |
log(evaluation_result, dataset_name, data, run_id=None, metadata=None, dataset_item_id=None, reason_key=DefaultValues.REASON_KEY, **kwargs)
Log a single evaluation result to Langfuse.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
The name of the dataset. |
required |
data |
MetricInput
|
The dataset data. |
required |
run_id |
str
|
The run ID. |
None
|
metadata |
dict
|
Additional metadata. |
None
|
dataset_item_id |
str | None
|
The ID of the dataset item. |
None
|
reason_key |
str | None
|
The key to extract reasoning/explanation text from evaluation results. This text will be logged as comments alongside scores in Langfuse traces. Defaults to "explanation". |
REASON_KEY
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
log_batch(evaluation_results, dataset_name, data_list, run_id=None, metadata=None, reason_key=DefaultValues.REASON_KEY, **kwargs)
async
Log a batch of evaluation results to Langfuse.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_results |
list[EvaluationOutput]
|
The list of evaluation results to log. |
required |
dataset_name |
str
|
The name of the dataset. |
required |
data_list |
list[MetricInput]
|
The list of dataset data items. |
required |
run_id |
str | None
|
The run ID. If None, a unique ID will be generated. |
None
|
metadata |
dict
|
Additional metadata. |
None
|
reason_key |
str | None
|
The key to extract reasoning/explanation text from evaluation results. This text will be logged as comments alongside scores in Langfuse traces. Defaults to "explanation". |
REASON_KEY
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
log_context(dataset_name, data, evaluation_result=None, run_id=None, metadata=None, dataset_item_id=None, reason_key=DefaultValues.REASON_KEY, **kwargs)
async
Open a context manager for Langfuse experiment tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_name |
str
|
The name of the dataset. |
required |
data |
MetricInput
|
The dataset data. |
required |
evaluation_result |
EvaluationOutput | None
|
The evaluation result to log. |
None
|
run_id |
str | None
|
The run ID. If None, a unique ID will be generated. |
None
|
metadata |
dict
|
Additional metadata. |
None
|
dataset_item_id |
str | None
|
The ID of the dataset item. |
None
|
reason_key |
str | None
|
The key to extract reasoning/explanation text from evaluation results. This text will be logged as comments alongside scores in Langfuse traces. Defaults to "explanation". |
REASON_KEY
|
**kwargs |
Any
|
Additional configuration parameters. |
{}
|
Yields:
| Type | Description |
|---|---|
AsyncGenerator[tuple[LangfuseSpan, DatasetItemClient], None]
|
tuple[LangfuseSpan, DatasetItemClient]: A tuple containing the Langfuse span and the dataset item client. |
SimpleExperimentTracker(project_name, output_dir='./gllm_evals/experiments', score_key='score')
Bases: BaseExperimentTracker
Simple file-based experiment tracker for development and testing.
This class provides a simple local storage implementation for experiment tracking. It stores experiment data in CSV format with two files: experiment_results.csv and leaderboard.csv.
Attributes:
| Name | Type | Description |
|---|---|---|
project_name |
str
|
The name of the project. |
output_dir |
Path
|
Directory to store experiment results. |
experiment_results_file |
Path
|
CSV file for experiment results. |
leaderboard_file |
Path
|
CSV file for leaderboard data. |
logger |
Logger
|
Logger for tracking errors and warnings. |
Initialize simple tracker with project name and output directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
project_name |
str
|
The name of the project. |
required |
output_dir |
str
|
Directory to store experiment results. |
'./gllm_evals/experiments'
|
score_key |
Union[str, List[str]]
|
The key to extract scores from evaluation results. - If str: Direct key access (e.g., "score") or dot notation (e.g., "metrics.accuracy") - If List[str]: Nested key path (e.g., ["generation", "score"]) Defaults to "score". |
'score'
|
alog(evaluation_result, dataset_name, data, run_id=None, metadata=None)
async
Log a single evaluation result (asynchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
dict[str, Any] | None
|
Additional metadata to log. |
None
|
get_experiment_history()
Get all experiment runs from leaderboard.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of experiment runs. |
get_run_results(run_id)
Get detailed results for a specific run.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run_id |
str
|
ID of the experiment run. |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: Detailed run results. |
log(evaluation_result, dataset_name, data, run_id=None, metadata=None)
Log a single evaluation result.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_result |
EvaluationOutput
|
The evaluation result to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
MetricInput
|
The input data that was evaluated. |
required |
run_id |
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
dict[str, Any] | None
|
Additional metadata to log. |
None
|
log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None)
async
Log a batch of evaluation results (asynchronous).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_results |
List[EvaluationOutput]
|
The evaluation results to log. |
required |
dataset_name |
str
|
Name of the dataset being evaluated. |
required |
data |
list[MetricInput]
|
List of input data that was evaluated. |
required |
run_id |
str | None
|
ID of the experiment run. Can be auto-generated if None. |
None
|
metadata |
dict[str, Any] | None
|
Additional metadata to log. |
None
|
detect_experiment_tracker(experiment_tracker, project_name, **kwargs)
Detect the experiment tracker type.
Supported experiment tracker types
- SimpleExperimentTracker (Default - Local CSV)
- LangfuseExperimentTracker (Langfuse) Required parameters: langfuse_client
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
experiment_tracker |
BaseExperimentTracker | type[BaseExperimentTracker]
|
The experiment tracker to be detected. |
required |
project_name |
str
|
The name of the project. |
required |
**kwargs |
Any
|
Additional arguments to pass to the constructor. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
BaseExperimentTracker |
BaseExperimentTracker
|
The experiment tracker. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the experiment tracker is not supported. |