Skip to content

Experiment Tracker

Experiment tracker init file.

BaseExperimentTracker(project_name, **kwargs)

Bases: ABC

Base class for all experiment trackers.

This class defines the core interface for experiment tracking across different backends. It provides methods for logging individual results and batch results using the observability pattern.

Attributes:

Name Type Description
project_name str

The name of the project.

Initialize the experiment tracker.

Parameters:

Name Type Description Default
project_name str

The name of the project.

required
**kwargs Any

Additional configuration parameters.

{}

alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod async

Log a single evaluation result (asynchronous).

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id Optional[str]

ID of the experiment run. Can be auto-generated if None.

None
metadata Optional[Dict[str, Any]]

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

get_experiment_history(**kwargs)

Get all experiment runs.

Parameters:

Name Type Description Default
**kwargs Any

Additional configuration parameters.

{}

Returns:

Type Description
List[Dict[str, Any]]

List[Dict[str, Any]]: List of experiment runs.

get_run_results(run_id, **kwargs)

Get detailed results for a specific run.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
**kwargs Any

Additional configuration parameters.

{}

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: Detailed run results.

log(evaluation_result, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod

Log a single evaluation result (synchronous).

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id Optional[str]

ID of the experiment run. Can be auto-generated if None.

None
metadata Optional[Dict[str, Any]]

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None, **kwargs) abstractmethod async

Log a batch of evaluation results (asynchronous).

Parameters:

Name Type Description Default
evaluation_results List[EvaluationOutput]

The evaluation results to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data List[MetricInput]

List of input data that was evaluated.

required
run_id Optional[str]

ID of the experiment run. Can be auto-generated if None.

None
metadata Optional[Dict[str, Any]]

Additional metadata to log.

None
**kwargs Any

Additional configuration parameters.

{}

log_context(**kwargs) async

Log a context.

Parameters:

Name Type Description Default
**kwargs Any

Additional configuration parameters.

{}

LangfuseExperimentTracker(score_key=DefaultValues.SCORE_KEY, project_name=DefaultValues.PROJECT_NAME, langfuse_client=None, expected_output_key=DefaultValues.EXPECTED_OUTPUT_KEY, mapping=None)

Bases: BaseExperimentTracker

Experiment tracker for Langfuse.

Attributes:

Name Type Description
langfuse_client Langfuse

The Langfuse client.

Initialize the LangfuseExperimentTracker class.

Parameters:

Name Type Description Default
score_key str | list[str]

The key(s) of the score(s) to log. Defaults to "score".

SCORE_KEY
project_name str

The name of the project.

PROJECT_NAME
langfuse_client Langfuse

The Langfuse client.

None
expected_output_key str | None

The key to extract the expected output from the data. Defaults to "expected_response".

EXPECTED_OUTPUT_KEY
mapping dict[str, Any] | None

Optional mapping for field keys. Defaults to None.

None

alog(evaluation_result, dataset_name, data, run_id=None, metadata=None, dataset_item_id=None, reason_key=DefaultValues.REASON_KEY, **kwargs) async

Log an evaluation result to Langfuse asynchronously.

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

The name of the dataset.

required
data MetricInput

The dataset data.

required
run_id str | None

The run ID.

None
metadata dict

Additional metadata.

None
dataset_item_id str | None

The ID of the dataset item.

None
reason_key str | None

The key to extract.

REASON_KEY
**kwargs Any

Additional configuration parameters.

{}

get_run_results(run_id, **kwargs)

Get detailed results for a specific run.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required
**kwargs Any

Additional configuration parameters including 'keys' for trace keys.

{}

Returns:

Type Description
list[dict[str, Any]]

dict[str, Any] | list[dict[str, Any]]: Detailed run results.

log(evaluation_result, dataset_name, data, run_id=None, metadata=None, dataset_item_id=None, reason_key=DefaultValues.REASON_KEY, **kwargs)

Log a single evaluation result to Langfuse.

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

The name of the dataset.

required
data MetricInput

The dataset data.

required
run_id str

The run ID.

None
metadata dict

Additional metadata.

None
dataset_item_id str | None

The ID of the dataset item.

None
reason_key str | None

The key to extract reasoning/explanation text from evaluation results. This text will be logged as comments alongside scores in Langfuse traces. Defaults to "explanation".

REASON_KEY
**kwargs Any

Additional configuration parameters.

{}

log_batch(evaluation_results, dataset_name, data_list, run_id=None, metadata=None, reason_key=DefaultValues.REASON_KEY, **kwargs) async

Log a batch of evaluation results to Langfuse.

Parameters:

Name Type Description Default
evaluation_results list[EvaluationOutput]

The list of evaluation results to log.

required
dataset_name str

The name of the dataset.

required
data_list list[MetricInput]

The list of dataset data items.

required
run_id str | None

The run ID. If None, a unique ID will be generated.

None
metadata dict

Additional metadata.

None
reason_key str | None

The key to extract reasoning/explanation text from evaluation results. This text will be logged as comments alongside scores in Langfuse traces. Defaults to "explanation".

REASON_KEY
**kwargs Any

Additional configuration parameters.

{}

log_context(dataset_name, data, evaluation_result=None, run_id=None, metadata=None, dataset_item_id=None, reason_key=DefaultValues.REASON_KEY, **kwargs) async

Open a context manager for Langfuse experiment tracking.

Parameters:

Name Type Description Default
dataset_name str

The name of the dataset.

required
data MetricInput

The dataset data.

required
evaluation_result EvaluationOutput | None

The evaluation result to log.

None
run_id str | None

The run ID. If None, a unique ID will be generated.

None
metadata dict

Additional metadata.

None
dataset_item_id str | None

The ID of the dataset item.

None
reason_key str | None

The key to extract reasoning/explanation text from evaluation results. This text will be logged as comments alongside scores in Langfuse traces. Defaults to "explanation".

REASON_KEY
**kwargs Any

Additional configuration parameters.

{}

Yields:

Type Description
AsyncGenerator[tuple[LangfuseSpan, DatasetItemClient], None]

tuple[LangfuseSpan, DatasetItemClient]: A tuple containing the Langfuse span and the dataset item client.

SimpleExperimentTracker(project_name, output_dir='./gllm_evals/experiments', score_key='score')

Bases: BaseExperimentTracker

Simple file-based experiment tracker for development and testing.

This class provides a simple local storage implementation for experiment tracking. It stores experiment data in CSV format with two files: experiment_results.csv and leaderboard.csv.

Attributes:

Name Type Description
project_name str

The name of the project.

output_dir Path

Directory to store experiment results.

experiment_results_file Path

CSV file for experiment results.

leaderboard_file Path

CSV file for leaderboard data.

logger Logger

Logger for tracking errors and warnings.

Initialize simple tracker with project name and output directory.

Parameters:

Name Type Description Default
project_name str

The name of the project.

required
output_dir str

Directory to store experiment results.

'./gllm_evals/experiments'
score_key Union[str, List[str]]

The key to extract scores from evaluation results. - If str: Direct key access (e.g., "score") or dot notation (e.g., "metrics.accuracy") - If List[str]: Nested key path (e.g., ["generation", "score"]) Defaults to "score".

'score'

alog(evaluation_result, dataset_name, data, run_id=None, metadata=None) async

Log a single evaluation result (asynchronous).

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None

get_experiment_history()

Get all experiment runs from leaderboard.

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of experiment runs.

get_run_results(run_id)

Get detailed results for a specific run.

Parameters:

Name Type Description Default
run_id str

ID of the experiment run.

required

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: Detailed run results.

log(evaluation_result, dataset_name, data, run_id=None, metadata=None)

Log a single evaluation result.

Parameters:

Name Type Description Default
evaluation_result EvaluationOutput

The evaluation result to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data MetricInput

The input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None

log_batch(evaluation_results, dataset_name, data, run_id=None, metadata=None) async

Log a batch of evaluation results (asynchronous).

Parameters:

Name Type Description Default
evaluation_results List[EvaluationOutput]

The evaluation results to log.

required
dataset_name str

Name of the dataset being evaluated.

required
data list[MetricInput]

List of input data that was evaluated.

required
run_id str | None

ID of the experiment run. Can be auto-generated if None.

None
metadata dict[str, Any] | None

Additional metadata to log.

None

detect_experiment_tracker(experiment_tracker, project_name, **kwargs)

Detect the experiment tracker type.

Supported experiment tracker types
  • SimpleExperimentTracker (Default - Local CSV)
  • LangfuseExperimentTracker (Langfuse) Required parameters: langfuse_client

Parameters:

Name Type Description Default
experiment_tracker BaseExperimentTracker | type[BaseExperimentTracker]

The experiment tracker to be detected.

required
project_name str

The name of the project.

required
**kwargs Any

Additional arguments to pass to the constructor.

{}

Returns:

Name Type Description
BaseExperimentTracker BaseExperimentTracker

The experiment tracker.

Raises:

Type Description
ValueError

If the experiment tracker is not supported.