Agent evaluator
Agent Evaluator.
An evaluator for evaluating agent tasks using LangChain AgentEvals trajectory accuracy metric.
References
[1] https://github.com/langchain-ai/agentevals
AgentEvaluator(model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, prompt=None, use_reference=True, continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: BaseEvaluator
Evaluator for agent tasks.
This evaluator uses the LangChain AgentEvals trajectory accuracy metric to evaluate the performance of AI agents based on their execution trajectories.
Default expected input
- agent_trajectory (list[dict[str, Any]]): The agent trajectory containing the sequence of actions, tool calls, and responses.
- expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory for reference-based evaluation.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
trajectory_accuracy_metric |
LangChainAgentTrajectoryAccuracyMetric
|
The metric used to evaluate agent trajectory accuracy. |
Initialize the AgentEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the trajectory accuracy metric. Defaults to DefaultValues.AGENT_EVALS_MODEL. |
AGENT_EVALS_MODEL
|
model_credentials
|
str | None
|
The model credentials. Defaults to None. This is required for the metric to function properly. |
None
|
model_config
|
dict[str, Any] | None
|
The model configuration. Defaults to None. |
None
|
prompt
|
str | None
|
Custom prompt for evaluation. If None, uses the default prompt from the metric. Defaults to None. |
None
|
use_reference
|
bool
|
Whether to use expected_agent_trajectory for reference-based evaluation. Defaults to True. |
True
|
continuous
|
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices
|
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning
|
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples
|
list[Any] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval). |
BATCH_MAX_ITERATIONS
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
required_fields
property
Returns the required fields for the data.
Returns:
| Type | Description |
|---|---|
set[str]
|
set[str]: The required fields for the data. |