Skip to content

Evaluator

Evaluator init file.

AgentEvaluator(model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, prompt=None, use_reference=True, continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: BaseEvaluator

Evaluator for agent tasks.

This evaluator uses the LangChain AgentEvals trajectory accuracy metric to evaluate the performance of AI agents based on their execution trajectories.

Default expected input
  • agent_trajectory (list[dict[str, Any]]): The agent trajectory containing the sequence of actions, tool calls, and responses.
  • expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory for reference-based evaluation.

Attributes:

Name Type Description
name str

The name of the evaluator.

trajectory_accuracy_metric LangChainAgentTrajectoryAccuracyMetric

The metric used to evaluate agent trajectory accuracy.

Initialize the AgentEvaluator.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use for the trajectory accuracy metric. Defaults to DefaultValues.AGENT_EVALS_MODEL.

AGENT_EVALS_MODEL
model_credentials str | None

The model credentials. Defaults to None. This is required for the metric to function properly.

None
model_config dict[str, Any] | None

The model configuration. Defaults to None.

None
prompt str | None

Custom prompt for evaluation. If None, uses the default prompt from the metric. Defaults to None.

None
use_reference bool

Whether to use expected_agent_trajectory for reference-based evaluation. Defaults to True.

True
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[Any] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

Raises:

Type Description
ValueError

If model_credentials is not provided.

required_fields: set[str] property

Returns the required fields for the data.

Returns:

Type Description
set[str]

set[str]: The required fields for the data.

BaseEvaluator(name)

Bases: ABC

Base class for all evaluators.

Attributes:

Name Type Description
name str

The name of the evaluator.

required_fields set[str]

The required fields for the evaluator.

Initialize the evaluator.

Parameters:

Name Type Description Default
name str

The name of the evaluator.

required

aggregate_required_fields(metrics, mode='any') staticmethod

Aggregate required fields from multiple metrics.

Parameters:

Name Type Description Default
metrics Iterable[BaseMetric]

The metrics to aggregate from.

required
mode str

The aggregation mode. Options: - "union": All fields required by any metric - "intersection": Only fields required by all metrics - "any": Empty set (no validation) Defaults to "any".

'any'

Returns:

Type Description
set[str]

set[str]: The aggregated required fields.

Raises:

Type Description
ValueError

If mode is not one of the supported options.

can_evaluate_any(metrics, data) staticmethod

Check if any of the metrics can evaluate the given data.

Parameters:

Name Type Description Default
metrics Iterable[BaseMetric]

The metrics to check.

required
data MetricInput

The data to validate against.

required

Returns:

Name Type Description
bool bool

True if any metric can evaluate the data, False otherwise.

ensure_list_of_dicts(data, key) staticmethod

Ensure that a field in the data is a list of dictionaries.

Parameters:

Name Type Description Default
data MetricInput

The data to validate.

required
key str

The key to check.

required

Raises:

Type Description
ValueError

If the field is not a list or contains non-dictionary elements.

ensure_non_empty_list(data, key) staticmethod

Ensure that a field in the data is a non-empty list.

Parameters:

Name Type Description Default
data MetricInput

The data to validate.

required
key str

The key to check.

required

Raises:

Type Description
ValueError

If the field is not a list or is empty.

evaluate(data) async

Evaluate the data.

Parameters:

Name Type Description Default
data MetricInput

The data to be evaluated.

required

Returns:

Name Type Description
EvaluateOutput EvaluationOutput

The evaluation output.

get_input_fields() classmethod

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type Description
list[str] | None

list[str] | None: The input fields.

get_input_spec() classmethod

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type Description
list[dict[str, Any]] | None

list[dict[str, Any]] | None: The input spec.

ClassicalRetrievalEvaluator(metrics=None, k=20)

Bases: BaseEvaluator

A class that evaluates the performance of a classical retrieval system.

Required fields: - retrieved_chunks: The retrieved chunks with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.

Example:

data = RetrievalData(
    retrieved_chunks={
        "chunk1": 0.9,
        "chunk2": 0.8,
        "chunk3": 0.7,
    },
    ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)

evaluator = ClassicalRetrievalEvaluator()
await evaluator.evaluate(data)

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics list[str | ClassicalRetrievalMetric] | None

The metrics to evaluate.

k int

The number of retrieved chunks to consider.

Initializes the evaluator.

Parameters:

Name Type Description Default
metrics list[str | ClassicalRetrievalMetric] | None

The metrics to evaluate. Defaults to all metrics.

None
k int | list[int]

The number of retrieved chunks to consider. Defaults to 20.

20

required_fields: set[str] property

Returns the required fields for the data.

Returns:

Type Description
set[str]

set[str]: The required fields for the data.

CustomEvaluator(metrics, name='custom', parallel=True)

Bases: BaseEvaluator

Custom evaluator.

This evaluator is used to evaluate the performance of the model.

Attributes:

Name Type Description
metrics list[BaseMetric]

The list of metrics to evaluate.

name str

The name of the evaluator.

parallel bool

Whether to evaluate the metrics in parallel.

Initialize the custom evaluator.

Parameters:

Name Type Description Default
metrics list[BaseMetric]

The list of metrics to evaluate.

required
name str

The name of the evaluator.

'custom'
parallel bool

Whether to evaluate the metrics in parallel.

True

GEvalGenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)

Bases: GenerationEvaluator

GEval Generation Evaluator.

This evaluator is used to evaluate the generation of the model.

Default expected input
  • query (str): The query to evaluate the generation of the model's output.
  • retrieved_context (str): The retrieved context to evaluate the generation of the model's output.
  • expected_response (str): The expected response to evaluate the generation of the model's output.
  • generated_response (str): The generated response to evaluate the generation of the model's output.

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics List[BaseMetric]

The list of metrics to evaluate.

run_parallel bool

Whether to run the metrics in parallel.

rule_book RuleBook | None

The rule book.

generation_rule_engine GenerationRuleEngine | None

The generation rule engine.

Initialize the GEval Generation Evaluator.

Parameters:

Name Type Description Default
metrics List[BaseMetric] | None

The list of metrics to evaluate.

None
enabled_metrics List[type[BaseMetric] | str] | None

The list of enabled metrics.

None
model str | ModelId | BaseLMInvoker

The model to use for the metrics.

MODEL
model_credentials str | None

The model credentials to use for the metrics.

None
model_config dict[str, Any] | None

The model config to use for the metrics.

None
run_parallel bool

Whether to run the metrics in parallel.

True
rule_book RuleBook | None

The rule book.

None
generation_rule_engine GenerationRuleEngine | None

The generation rule engine.

None
judge MultipleLLMAsJudge | None

Optional multiple LLM judge for ensemble evaluation.

None

GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)

Bases: BaseEvaluator

Evaluator for generation tasks.

Default expected input
  • query (str): The query to evaluate the completeness of the model's output.
  • retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
  • expected_response (str): The expected response to evaluate the completeness of the model's output.
  • generated_response (str): The generated response to evaluate the completeness of the model's output.

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics List[BaseMetric]

The list of metrics to evaluate.

run_parallel bool

Whether to run the metrics in parallel.

rule_book RuleBook | None

The rule book.

generation_rule_engine GenerationRuleEngine | None

The generation rule engine.

judge MultipleLLMAsJudge | None

Optional multiple LLM judge for ensemble evaluation.

Initialize the GenerationEvaluator.

Parameters:

Name Type Description Default
metrics List[BaseMetric] | None

A list of metric instances to use as a base pool. If None, defaults to [CompletenessMetric, RedundancyMetric, GroundednessMetric]. Each custom metrics, must generate a score key in the output.

None
enabled_metrics List[type[BaseMetric] | str] | None

A list of metric classes or names to enable from the pool. If None, all metrics in the pool are used.

None
model str | ModelId | BaseLMInvoker

The model to use for the metrics.

MODEL
model_config dict[str, Any] | None

The model config to use for the metrics.

None
model_credentials str | None

The model credentials, used for initializing default metrics. Defaults to None. This is required if some of the default metrics are used.

None
run_parallel bool

Whether to run the metrics in parallel. Defaults to False.

True
rule_book RuleBook | None

The rule book for evaluation. If not provided, a default one is generated, but only if all enabled metrics are from the default set.

None
generation_rule_engine GenerationRuleEngine | None

The generation rule engine. Defaults to a new instance with the determined rule book.

None
judge MultipleLLMAsJudge | None

Optional multiple LLM judge for ensemble evaluation. If provided, will use multiple judges instead of single model evaluation. Uses composition pattern for clean separation of concerns.

None

Raises:

Type Description
ValueError

If model_credentials is not provided when using default metrics.

ValueError

If a custom rule_book is provided when using custom metrics or a mix of custom and default metrics without an explicit rule_book.

QTEvaluator(completeness_metric=None, groundedness_metric=None, redundancy_metric=None, model=DefaultValues.MODEL, model_config=None, model_credentials=None, run_parallel=True, score_mapping=None, score_weights=None, judge=None)

Bases: BaseEvaluator

Evaluator for query transformation tasks.

Default expected input: - query (str): The query to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output.

Attributes:

Name Type Description
completeness_metric CompletenessMetric

The completeness metric.

hallucination_metric GroundednessMetric

The groundedness metric.

redundancy_metric RedundancyMetric

The redundancy metric.

run_parallel bool

Whether to run the metrics in parallel.

score_mapping dict[str, dict[int, float]]

The score mapping.

score_weights dict[str, float]

The score weights.

Initialize the QTEvaluator.

Parameters:

Name Type Description Default
completeness_metric CompletenessMetric | None

The completeness metric. Defaults to built-in CompletenessMetric.

None
groundedness_metric GroundednessMetric | None

The groundedness metric. Defaults to built-in GroundednessMetric.

None
redundancy_metric RedundancyMetric | None

The redundancy metric. Defaults to built-in RedundancyMetric.

None
model str | ModelId | BaseLMInvoker

The model to use for the metrics.

MODEL
model_config dict[str, Any] | None

The model config to use for the metrics.

None
model_credentials str | None

The model credentials. Defaults to None. This is required if some of the default metrics are used.

None
run_parallel bool

Whether to run the metrics in parallel. Defaults to True.

True
score_mapping dict[str, dict[int, float]] | None

The score mapping. Defaults to None. This is required if some of the default metrics are used.

None
score_weights dict[str, float] | None

The score weights. Defaults to None.

None
judge MultipleLLMAsJudge | None

Optional multiple LLM judge for ensemble evaluation.

None

Raises:

Type Description
ValueError

If model_credentials is not provided when using default metrics.

required_fields: set[str] property

Returns the required fields for the data.

Returns:

Type Description
set[str]

set[str]: The required fields for the data.