Evaluator
Evaluator init file.
AgentEvaluator(model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, prompt=None, use_reference=True, continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: BaseEvaluator
Evaluator for agent tasks.
This evaluator uses the LangChain AgentEvals trajectory accuracy metric to evaluate the performance of AI agents based on their execution trajectories.
Default expected input
- agent_trajectory (list[dict[str, Any]]): The agent trajectory containing the sequence of actions, tool calls, and responses.
- expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory for reference-based evaluation.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
trajectory_accuracy_metric |
LangChainAgentTrajectoryAccuracyMetric
|
The metric used to evaluate agent trajectory accuracy. |
Initialize the AgentEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the trajectory accuracy metric. Defaults to DefaultValues.AGENT_EVALS_MODEL. |
AGENT_EVALS_MODEL
|
model_credentials |
str | None
|
The model credentials. Defaults to None. This is required for the metric to function properly. |
None
|
model_config |
dict[str, Any] | None
|
The model configuration. Defaults to None. |
None
|
prompt |
str | None
|
Custom prompt for evaluation. If None, uses the default prompt from the metric. Defaults to None. |
None
|
use_reference |
bool
|
Whether to use expected_agent_trajectory for reference-based evaluation. Defaults to True. |
True
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[Any] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
required_fields: set[str]
property
Returns the required fields for the data.
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The required fields for the data. |
BaseEvaluator(name)
Bases: ABC
Base class for all evaluators.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
required_fields |
set[str]
|
The required fields for the evaluator. |
Initialize the evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the evaluator. |
required |
aggregate_required_fields(metrics, mode='any')
staticmethod
Aggregate required fields from multiple metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
Iterable[BaseMetric]
|
The metrics to aggregate from. |
required |
mode |
str
|
The aggregation mode. Options: - "union": All fields required by any metric - "intersection": Only fields required by all metrics - "any": Empty set (no validation) Defaults to "any". |
'any'
|
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The aggregated required fields. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If mode is not one of the supported options. |
can_evaluate_any(metrics, data)
staticmethod
Check if any of the metrics can evaluate the given data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
Iterable[BaseMetric]
|
The metrics to check. |
required |
data |
MetricInput
|
The data to validate against. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if any metric can evaluate the data, False otherwise. |
ensure_list_of_dicts(data, key)
staticmethod
Ensure that a field in the data is a list of dictionaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
MetricInput
|
The data to validate. |
required |
key |
str
|
The key to check. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the field is not a list or contains non-dictionary elements. |
ensure_non_empty_list(data, key)
staticmethod
Ensure that a field in the data is a non-empty list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
MetricInput
|
The data to validate. |
required |
key |
str
|
The key to check. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the field is not a list or is empty. |
evaluate(data)
async
Evaluate the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
MetricInput
|
The data to be evaluated. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
EvaluateOutput |
EvaluationOutput
|
The evaluation output. |
get_input_fields()
classmethod
Return declared input field names if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[str] | None
|
list[str] | None: The input fields. |
get_input_spec()
classmethod
Return structured spec for input fields if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | None
|
list[dict[str, Any]] | None: The input spec. |
ClassicalRetrievalEvaluator(metrics=None, k=20)
Bases: BaseEvaluator
A class that evaluates the performance of a classical retrieval system.
Required fields: - retrieved_chunks: The retrieved chunks with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.
Example:
data = RetrievalData(
retrieved_chunks={
"chunk1": 0.9,
"chunk2": 0.8,
"chunk3": 0.7,
},
ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
evaluator = ClassicalRetrievalEvaluator()
await evaluator.evaluate(data)
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
metrics |
list[str | ClassicalRetrievalMetric] | None
|
The metrics to evaluate. |
k |
int
|
The number of retrieved chunks to consider. |
Initializes the evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
list[str | ClassicalRetrievalMetric] | None
|
The metrics to evaluate. Defaults to all metrics. |
None
|
k |
int | list[int]
|
The number of retrieved chunks to consider. Defaults to 20. |
20
|
required_fields: set[str]
property
Returns the required fields for the data.
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The required fields for the data. |
CustomEvaluator(metrics, name='custom', parallel=True)
Bases: BaseEvaluator
Custom evaluator.
This evaluator is used to evaluate the performance of the model.
Attributes:
| Name | Type | Description |
|---|---|---|
metrics |
list[BaseMetric]
|
The list of metrics to evaluate. |
name |
str
|
The name of the evaluator. |
parallel |
bool
|
Whether to evaluate the metrics in parallel. |
Initialize the custom evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
list[BaseMetric]
|
The list of metrics to evaluate. |
required |
name |
str
|
The name of the evaluator. |
'custom'
|
parallel |
bool
|
Whether to evaluate the metrics in parallel. |
True
|
GEvalGenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)
Bases: GenerationEvaluator
GEval Generation Evaluator.
This evaluator is used to evaluate the generation of the model.
Default expected input
- query (str): The query to evaluate the generation of the model's output.
- retrieved_context (str): The retrieved context to evaluate the generation of the model's output.
- expected_response (str): The expected response to evaluate the generation of the model's output.
- generated_response (str): The generated response to evaluate the generation of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
metrics |
List[BaseMetric]
|
The list of metrics to evaluate. |
run_parallel |
bool
|
Whether to run the metrics in parallel. |
rule_book |
RuleBook | None
|
The rule book. |
generation_rule_engine |
GenerationRuleEngine | None
|
The generation rule engine. |
Initialize the GEval Generation Evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
List[BaseMetric] | None
|
The list of metrics to evaluate. |
None
|
enabled_metrics |
List[type[BaseMetric] | str] | None
|
The list of enabled metrics. |
None
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metrics. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metrics. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metrics. |
None
|
run_parallel |
bool
|
Whether to run the metrics in parallel. |
True
|
rule_book |
RuleBook | None
|
The rule book. |
None
|
generation_rule_engine |
GenerationRuleEngine | None
|
The generation rule engine. |
None
|
judge |
MultipleLLMAsJudge | None
|
Optional multiple LLM judge for ensemble evaluation. |
None
|
GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)
Bases: BaseEvaluator
Evaluator for generation tasks.
Default expected input
- query (str): The query to evaluate the completeness of the model's output.
- retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
- expected_response (str): The expected response to evaluate the completeness of the model's output.
- generated_response (str): The generated response to evaluate the completeness of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
metrics |
List[BaseMetric]
|
The list of metrics to evaluate. |
run_parallel |
bool
|
Whether to run the metrics in parallel. |
rule_book |
RuleBook | None
|
The rule book. |
generation_rule_engine |
GenerationRuleEngine | None
|
The generation rule engine. |
judge |
MultipleLLMAsJudge | None
|
Optional multiple LLM judge for ensemble evaluation. |
Initialize the GenerationEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
List[BaseMetric] | None
|
A list of metric instances to use as a base pool.
If None, defaults to |
None
|
enabled_metrics |
List[type[BaseMetric] | str] | None
|
A list of metric classes or names to enable from the pool. If None, all metrics in the pool are used. |
None
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metrics. |
MODEL
|
model_config |
dict[str, Any] | None
|
The model config to use for the metrics. |
None
|
model_credentials |
str | None
|
The model credentials, used for initializing default metrics. Defaults to None. This is required if some of the default metrics are used. |
None
|
run_parallel |
bool
|
Whether to run the metrics in parallel. Defaults to False. |
True
|
rule_book |
RuleBook | None
|
The rule book for evaluation. If not provided, a default one is generated, but only if all enabled metrics are from the default set. |
None
|
generation_rule_engine |
GenerationRuleEngine | None
|
The generation rule engine. Defaults to a new instance with the determined rule book. |
None
|
judge |
MultipleLLMAsJudge | None
|
Optional multiple LLM judge for ensemble evaluation. If provided, will use multiple judges instead of single model evaluation. Uses composition pattern for clean separation of concerns. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If a custom |
QTEvaluator(completeness_metric=None, groundedness_metric=None, redundancy_metric=None, model=DefaultValues.MODEL, model_config=None, model_credentials=None, run_parallel=True, score_mapping=None, score_weights=None, judge=None)
Bases: BaseEvaluator
Evaluator for query transformation tasks.
Default expected input: - query (str): The query to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
completeness_metric |
CompletenessMetric
|
The completeness metric. |
hallucination_metric |
GroundednessMetric
|
The groundedness metric. |
redundancy_metric |
RedundancyMetric
|
The redundancy metric. |
run_parallel |
bool
|
Whether to run the metrics in parallel. |
score_mapping |
dict[str, dict[int, float]]
|
The score mapping. |
score_weights |
dict[str, float]
|
The score weights. |
Initialize the QTEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
completeness_metric |
CompletenessMetric | None
|
The completeness metric. Defaults to built-in CompletenessMetric. |
None
|
groundedness_metric |
GroundednessMetric | None
|
The groundedness metric. Defaults to built-in GroundednessMetric. |
None
|
redundancy_metric |
RedundancyMetric | None
|
The redundancy metric. Defaults to built-in RedundancyMetric. |
None
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metrics. |
MODEL
|
model_config |
dict[str, Any] | None
|
The model config to use for the metrics. |
None
|
model_credentials |
str | None
|
The model credentials. Defaults to None. This is required if some of the default metrics are used. |
None
|
run_parallel |
bool
|
Whether to run the metrics in parallel. Defaults to True. |
True
|
score_mapping |
dict[str, dict[int, float]] | None
|
The score mapping. Defaults to None. This is required if some of the default metrics are used. |
None
|
score_weights |
dict[str, float] | None
|
The score weights. Defaults to None. |
None
|
judge |
MultipleLLMAsJudge | None
|
Optional multiple LLM judge for ensemble evaluation. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
required_fields: set[str]
property
Returns the required fields for the data.
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The required fields for the data. |