Skip to content

Deepeval tool correctness

DeepEval Tool Correctness Metric Integration.

DeepEvalToolCorrectnessMetric(threshold=0.5, model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: DeepEvalMetricFactory

DeepEval Tool Correctness Metric.

This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.

Available Fields
  • query (str): The input query.
  • generated_response (str, optional): The actual output/response.
  • expected_response (str, optional): The expected output/response.
  • tools_called (list[dict], optional): The tools actually called by the agent. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from agent_trajectory.
  • expected_tools (list[dict], optional): The expected tools to be called. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory.
  • agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field.
  • expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory.
  • available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.
Cookbook Example

Please refer to example_deepeval_tool_correctness.py in the gen-ai-sdk-cookbook repository.

Initializes DeepEvalToolCorrectnessMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5. Also used as good_score for BaseEvaluator's global explanation generation.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

AGENT_EVALS_MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
include_reason bool

Include reasoning in output. Defaults to True.

True
strict_mode bool

Binary mode (0 or 1). Defaults to False.

False
should_exact_match bool

Require exact match of tools. Defaults to False.

False
should_consider_ordering bool

Consider order of tools called. Defaults to False.

False
available_tools list[dict[str, Any]] | None

List of available tool definitions for context. Each tool should have 'name', 'description', and 'parameters'. Defaults to None.

None
evaluation_params list[ToolCallParams] | None

List of strictness criteria for tool correctness. Available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT. Defaults to [ToolCallParams.INPUT_PARAMETERS, ToolCallParams.OUTPUT] to validate both.

None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS

ToolCallParser()

ToolCallParser is used to parse tool call data into steps.

Initialize the ToolCallParser.

create_tool_call_object(steps, normalize_names=True, include_output=True)

Extract tool calls from agent steps.

This is a shared utility function used by all metric classes to extract tool calls from agent steps in a consistent manner.

Parameters:

Name Type Description Default
steps list[dict[str, Any]]

List of agent steps with 'name', 'args', and optional 'output'.

required
normalize_names bool

Whether to normalize tool names by removing dynamic suffixes (e.g., _xg7c). Defaults to True.

True
include_output bool

Whether to include the 'output' field from each step in the ToolCall object. Defaults to True.

True

Returns:

Type Description
list

List of ToolCall objects with 'name', 'input_parameters', and optionally 'output'.

normalize_tool_name(tool_name)

Normalize tool names by removing dynamic suffixes.

Tool names like 'delegate_to_sql_query_agent_xg7c' or 'delegate_to_sql_query_agent_nxle' become 'delegate_to_sql_query_agent' by removing the last underscore and 4-char suffix.

Removes suffixes that look like random IDs (4 chars alphanumeric mix with both letters and digits, e.g. _xg7c, _nxle, _zor2), not legitimate tool name parts like '_data' or '_name'.

Parameters:

Name Type Description Default
tool_name str

The original tool name with potential suffix.

required

Returns:

Type Description
str

Normalized tool name without dynamic suffix.

parse(tool_calls)

Parse the tool call data into steps.

TrajectoryParser()

TrajectoryParser is used to parse trajectory data into steps.

Initialize the TrajectoryParser.

convert_trajectory_to_steps(trajectory)

Convert trajectory data to expected steps format.

This function converts agent trajectory data into a flattened list of steps. When a tool call has both a step without output and a step with output, they share the same tool_call_id to indicate they represent the same execution.

Parameters:

Name Type Description Default
trajectory list[dict]

List of trajectory messages with roles (user/assistant/tool).

required

Returns:

Type Description
list[dict]

List of step dictionaries with tool_call_id, kind, name, args, and optional output.

list[dict]

Steps representing the same tool call (with and without output) share the same tool_call_id.

parse(trajectory)

Parse the trajectory data into steps.