Agent evaluator

Agent Evaluator.

An evaluator for evaluating agent tasks using LangChain AgentEvals trajectory accuracy metric.

Authors

Apri Dwi Rachmadi (apri.d.rachmadi@gdplabs.id)

References

[1] https://github.com/langchain-ai/agentevals

`AgentEvaluator(model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, prompt=None, use_reference=True, continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: BaseEvaluator

Evaluator for agent tasks.

This evaluator uses the LangChain AgentEvals trajectory accuracy metric to evaluate the performance of AI agents based on their execution trajectories.

Default expected input

agent_trajectory (list[dict[str, Any]]): The agent trajectory containing the sequence of actions, tool calls, and responses.
expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory for reference-based evaluation.

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`trajectory_accuracy_metric`	`LangChainAgentTrajectoryAccuracyMetric`	The metric used to evaluate agent trajectory accuracy.

Initialize the AgentEvaluator.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the trajectory accuracy metric. Defaults to DefaultValues.AGENT_EVALS_MODEL.	`AGENT_EVALS_MODEL`
`model_credentials`	`str \| None`	The model credentials. Defaults to None. This is required for the metric to function properly.	`None`
`model_config`	`dict[str, Any] \| None`	The model configuration. Defaults to None.	`None`
`prompt`	`str \| None`	Custom prompt for evaluation. If None, uses the default prompt from the metric. Defaults to None.	`None`
`use_reference`	`bool`	Whether to use expected_agent_trajectory for reference-based evaluation. Defaults to True.	`True`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[Any] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`
`batch_status_check_interval`	`float`	Time between batch status checks in seconds. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).	`BATCH_MAX_ITERATIONS`

Raises:

Type	Description
`ValueError`	If `model_credentials` is not provided.

`required_fields` `property`

Returns the required fields for the data.

Returns:

Type	Description
`set[str]`	set[str]: The required fields for the data.

Agent evaluator

required_fields property

`required_fields` `property`