Evaluator

Evaluator init file.

`AgentEvaluator(model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, prompt=None, use_reference=True, continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: BaseEvaluator

Evaluator for agent tasks.

This evaluator uses the LangChain AgentEvals trajectory accuracy metric to evaluate the performance of AI agents based on their execution trajectories.

Default expected input

agent_trajectory (list[dict[str, Any]]): The agent trajectory containing the sequence of actions, tool calls, and responses.
expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory for reference-based evaluation.

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`trajectory_accuracy_metric`	`LangChainAgentTrajectoryAccuracyMetric`	The metric used to evaluate agent trajectory accuracy.

Initialize the AgentEvaluator.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the trajectory accuracy metric. Defaults to DefaultValues.AGENT_EVALS_MODEL.	`AGENT_EVALS_MODEL`
`model_credentials`	`str \| None`	The model credentials. Defaults to None. This is required for the metric to function properly.	`None`
`model_config`	`dict[str, Any] \| None`	The model configuration. Defaults to None.	`None`
`prompt`	`str \| None`	Custom prompt for evaluation. If None, uses the default prompt from the metric. Defaults to None.	`None`
`use_reference`	`bool`	Whether to use expected_agent_trajectory for reference-based evaluation. Defaults to True.	`True`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[Any] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

Raises:

Type	Description
`ValueError`	If `model_credentials` is not provided.

`required_fields: set[str]` `property`

Returns the required fields for the data.

Returns:

Type	Description
`set[str]`	set[str]: The required fields for the data.

`BaseEvaluator(name)`

Bases: ABC

Base class for all evaluators.

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`required_fields`	`set[str]`	The required fields for the evaluator.

Initialize the evaluator.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the evaluator.	required

`aggregate_required_fields(metrics, mode='any')` `staticmethod`

Aggregate required fields from multiple metrics.

Parameters:

Name	Type	Description	Default
`metrics`	`Iterable[BaseMetric]`	The metrics to aggregate from.	required
`mode`	`str`	The aggregation mode. Options: - "union": All fields required by any metric - "intersection": Only fields required by all metrics - "any": Empty set (no validation) Defaults to "any".	`'any'`

Returns:

Type	Description
`set[str]`	set[str]: The aggregated required fields.

Raises:

Type	Description
`ValueError`	If mode is not one of the supported options.

`can_evaluate_any(metrics, data)` `staticmethod`

Check if any of the metrics can evaluate the given data.

Parameters:

Name	Type	Description	Default
`metrics`	`Iterable[BaseMetric]`	The metrics to check.	required
`data`	`MetricInput`	The data to validate against.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if any metric can evaluate the data, False otherwise.

`ensure_list_of_dicts(data, key)` `staticmethod`

Ensure that a field in the data is a list of dictionaries.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput`	The data to validate.	required
`key`	`str`	The key to check.	required

Raises:

Type	Description
`ValueError`	If the field is not a list or contains non-dictionary elements.

`ensure_non_empty_list(data, key)` `staticmethod`

Ensure that a field in the data is a non-empty list.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput`	The data to validate.	required
`key`	`str`	The key to check.	required

Raises:

Type	Description
`ValueError`	If the field is not a list or is empty.

`evaluate(data)` `async`

Evaluate the data.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput`	The data to be evaluated.	required

Returns:

Name	Type	Description
`EvaluateOutput`	`EvaluationOutput`	The evaluation output.

`get_input_fields()` `classmethod`

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type	Description
`list[str] \| None`	list[str] \| None: The input fields.

`get_input_spec()` `classmethod`

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type	Description
`list[dict[str, Any]] \| None`	list[dict[str, Any]] \| None: The input spec.

`ClassicalRetrievalEvaluator(metrics=None, k=20)`

Bases: BaseEvaluator

A class that evaluates the performance of a classical retrieval system.

Required fields: - retrieved_chunks: The retrieved chunks with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.

Example:

data = RetrievalData(
    retrieved_chunks={
        "chunk1": 0.9,
        "chunk2": 0.8,
        "chunk3": 0.7,
    },
    ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)

evaluator = ClassicalRetrievalEvaluator()
await evaluator.evaluate(data)

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`metrics`	`list[str \| ClassicalRetrievalMetric] \| None`	The metrics to evaluate.
`k`	`int`	The number of retrieved chunks to consider.

Initializes the evaluator.

Parameters:

Name	Type	Description	Default
`metrics`	`list[str \| ClassicalRetrievalMetric] \| None`	The metrics to evaluate. Defaults to all metrics.	`None`
`k`	`int \| list[int]`	The number of retrieved chunks to consider. Defaults to 20.	`20`

`required_fields: set[str]` `property`

Returns the required fields for the data.

Returns:

Type	Description
`set[str]`	set[str]: The required fields for the data.

`CustomEvaluator(metrics, name='custom', parallel=True)`

Bases: BaseEvaluator

Custom evaluator.

This evaluator is used to evaluate the performance of the model.

Attributes:

Name	Type	Description
`metrics`	`list[BaseMetric]`	The list of metrics to evaluate.
`name`	`str`	The name of the evaluator.
`parallel`	`bool`	Whether to evaluate the metrics in parallel.

Initialize the custom evaluator.

Parameters:

Name	Type	Description	Default
`metrics`	`list[BaseMetric]`	The list of metrics to evaluate.	required
`name`	`str`	The name of the evaluator.	`'custom'`
`parallel`	`bool`	Whether to evaluate the metrics in parallel.	`True`

`GEvalGenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)`

Bases: GenerationEvaluator

GEval Generation Evaluator.

This evaluator is used to evaluate the generation of the model.

Default expected input

query (str): The query to evaluate the generation of the model's output.
retrieved_context (str): The retrieved context to evaluate the generation of the model's output.
expected_response (str): The expected response to evaluate the generation of the model's output.
generated_response (str): The generated response to evaluate the generation of the model's output.

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`metrics`	`List[BaseMetric]`	The list of metrics to evaluate.
`run_parallel`	`bool`	Whether to run the metrics in parallel.
`rule_book`	`RuleBook \| None`	The rule book.
`generation_rule_engine`	`GenerationRuleEngine \| None`	The generation rule engine.

Initialize the GEval Generation Evaluator.

Parameters:

Name	Type	Description	Default
`metrics`	`List[BaseMetric] \| None`	The list of metrics to evaluate.	`None`
`enabled_metrics`	`List[type[BaseMetric] \| str] \| None`	The list of enabled metrics.	`None`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metrics.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metrics.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metrics.	`None`
`run_parallel`	`bool`	Whether to run the metrics in parallel.	`True`
`rule_book`	`RuleBook \| None`	The rule book.	`None`
`generation_rule_engine`	`GenerationRuleEngine \| None`	The generation rule engine.	`None`
`judge`	`MultipleLLMAsJudge \| None`	Optional multiple LLM judge for ensemble evaluation.	`None`

`GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)`

Bases: BaseEvaluator

Evaluator for generation tasks.

Default expected input

query (str): The query to evaluate the completeness of the model's output.
retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
expected_response (str): The expected response to evaluate the completeness of the model's output.
generated_response (str): The generated response to evaluate the completeness of the model's output.

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`metrics`	`List[BaseMetric]`	The list of metrics to evaluate.
`run_parallel`	`bool`	Whether to run the metrics in parallel.
`rule_book`	`RuleBook \| None`	The rule book.
`generation_rule_engine`	`GenerationRuleEngine \| None`	The generation rule engine.
`judge`	`MultipleLLMAsJudge \| None`	Optional multiple LLM judge for ensemble evaluation.

Initialize the GenerationEvaluator.

Parameters:

Name	Type	Description	Default
`metrics`	`List[BaseMetric] \| None`	A list of metric instances to use as a base pool. If None, defaults to `[CompletenessMetric, RedundancyMetric, GroundednessMetric]`. Each custom metrics, must generate a `score` key in the output.	`None`
`enabled_metrics`	`List[type[BaseMetric] \| str] \| None`	A list of metric classes or names to enable from the pool. If None, all metrics in the pool are used.	`None`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metrics.	`MODEL`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metrics.	`None`
`model_credentials`	`str \| None`	The model credentials, used for initializing default metrics. Defaults to None. This is required if some of the default metrics are used.	`None`
`run_parallel`	`bool`	Whether to run the metrics in parallel. Defaults to False.	`True`
`rule_book`	`RuleBook \| None`	The rule book for evaluation. If not provided, a default one is generated, but only if all enabled metrics are from the default set.	`None`
`generation_rule_engine`	`GenerationRuleEngine \| None`	The generation rule engine. Defaults to a new instance with the determined rule book.	`None`
`judge`	`MultipleLLMAsJudge \| None`	Optional multiple LLM judge for ensemble evaluation. If provided, will use multiple judges instead of single model evaluation. Uses composition pattern for clean separation of concerns.	`None`

Raises:

Type	Description
`ValueError`	If `model_credentials` is not provided when using default metrics.
`ValueError`	If a custom `rule_book` is provided when using custom metrics or a mix of custom and default metrics without an explicit rule_book.

`QTEvaluator(completeness_metric=None, groundedness_metric=None, redundancy_metric=None, model=DefaultValues.MODEL, model_config=None, model_credentials=None, run_parallel=True, score_mapping=None, score_weights=None, judge=None)`

Bases: BaseEvaluator

Evaluator for query transformation tasks.

Default expected input: - query (str): The query to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output.

Attributes:

Name	Type	Description
`completeness_metric`	`CompletenessMetric`	The completeness metric.
`hallucination_metric`	`GroundednessMetric`	The groundedness metric.
`redundancy_metric`	`RedundancyMetric`	The redundancy metric.
`run_parallel`	`bool`	Whether to run the metrics in parallel.
`score_mapping`	`dict[str, dict[int, float]]`	The score mapping.
`score_weights`	`dict[str, float]`	The score weights.

Initialize the QTEvaluator.

Parameters:

Name	Type	Description	Default
`completeness_metric`	`CompletenessMetric \| None`	The completeness metric. Defaults to built-in CompletenessMetric.	`None`
`groundedness_metric`	`GroundednessMetric \| None`	The groundedness metric. Defaults to built-in GroundednessMetric.	`None`
`redundancy_metric`	`RedundancyMetric \| None`	The redundancy metric. Defaults to built-in RedundancyMetric.	`None`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metrics.	`MODEL`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metrics.	`None`
`model_credentials`	`str \| None`	The model credentials. Defaults to None. This is required if some of the default metrics are used.	`None`
`run_parallel`	`bool`	Whether to run the metrics in parallel. Defaults to True.	`True`
`score_mapping`	`dict[str, dict[int, float]] \| None`	The score mapping. Defaults to None. This is required if some of the default metrics are used.	`None`
`score_weights`	`dict[str, float] \| None`	The score weights. Defaults to None.	`None`
`judge`	`MultipleLLMAsJudge \| None`	Optional multiple LLM judge for ensemble evaluation.	`None`

Raises:

Type	Description
`ValueError`	If `model_credentials` is not provided when using default metrics.

`required_fields: set[str]` `property`

Returns the required fields for the data.

Returns:

Type	Description
`set[str]`	set[str]: The required fields for the data.

Evaluator

AgentEvaluator(model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, prompt=None, use_reference=True, continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

required_fields: set[str] property

BaseEvaluator(name)

aggregate_required_fields(metrics, mode='any') staticmethod

can_evaluate_any(metrics, data) staticmethod

ensure_list_of_dicts(data, key) staticmethod

ensure_non_empty_list(data, key) staticmethod

evaluate(data) async

get_input_fields() classmethod

get_input_spec() classmethod

ClassicalRetrievalEvaluator(metrics=None, k=20)

required_fields: set[str] property

CustomEvaluator(metrics, name='custom', parallel=True)

GEvalGenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)

GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)

QTEvaluator(completeness_metric=None, groundedness_metric=None, redundancy_metric=None, model=DefaultValues.MODEL, model_config=None, model_credentials=None, run_parallel=True, score_mapping=None, score_weights=None, judge=None)

required_fields: set[str] property

`AgentEvaluator(model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, prompt=None, use_reference=True, continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`required_fields: set[str]` `property`

`BaseEvaluator(name)`

`aggregate_required_fields(metrics, mode='any')` `staticmethod`

`can_evaluate_any(metrics, data)` `staticmethod`

`ensure_list_of_dicts(data, key)` `staticmethod`

`ensure_non_empty_list(data, key)` `staticmethod`

`evaluate(data)` `async`

`get_input_fields()` `classmethod`

`get_input_spec()` `classmethod`

`ClassicalRetrievalEvaluator(metrics=None, k=20)`

`required_fields: set[str]` `property`

`CustomEvaluator(metrics, name='custom', parallel=True)`

`GEvalGenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)`

`GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)`

`QTEvaluator(completeness_metric=None, groundedness_metric=None, redundancy_metric=None, model=DefaultValues.MODEL, model_config=None, model_credentials=None, run_parallel=True, score_mapping=None, score_weights=None, judge=None)`

`required_fields: set[str]` `property`