Deepeval tool correctness
DeepEval Tool Correctness Metric Integration.
DeepEvalToolCorrectnessMetric(threshold=0.5, model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalMetricFactory
DeepEval Tool Correctness Metric.
This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.
Available Fields
- query (str): The input query.
- generated_response (str, optional): The actual output/response.
- expected_response (str, optional): The expected output/response.
- tools_called (list[dict], optional): The tools actually called by the agent. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from agent_trajectory.
- expected_tools (list[dict], optional): The expected tools to be called. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory.
- agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field.
- expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory.
- available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.
Scoring
- 0.0-1.0 (Continuous): Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.
Cookbook Example
Please refer to example_deepeval_tool_correctness.py in the gen-ai-sdk-cookbook repository.
Initializes DeepEvalToolCorrectnessMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. Also used as good_score for BaseEvaluator's global explanation generation. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
AGENT_EVALS_MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
include_reason
|
bool
|
Include reasoning in output. Defaults to True. |
True
|
strict_mode
|
bool
|
Binary mode (0 or 1). Defaults to False. |
False
|
should_exact_match
|
bool
|
Require exact match of tools. Defaults to False. |
False
|
should_consider_ordering
|
bool
|
Consider order of tools called. Defaults to False. |
False
|
available_tools
|
list[dict[str, Any]] | None
|
List of available tool definitions for context. Each tool should have 'name', 'description', and 'parameters'. Defaults to None. |
None
|
evaluation_params
|
list[ToolCallParams] | None
|
List of strictness criteria for tool correctness. Available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT. Defaults to [ToolCallParams.INPUT_PARAMETERS, ToolCallParams.OUTPUT] to validate both. |
None
|
batch_status_check_interval
|
float
|
Interval in seconds between batch status checks. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of batch status check iterations. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
ToolCallParser()
ToolCallParser is used to parse tool call data into steps.
Initialize the ToolCallParser.
create_tool_call_object(steps, normalize_names=True, include_output=True)
Extract tool calls from agent steps.
This is a shared utility function used by all metric classes to extract tool calls from agent steps in a consistent manner.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
steps
|
list[dict[str, Any]]
|
List of agent steps with 'name', 'args', and optional 'output'. |
required |
normalize_names
|
bool
|
Whether to normalize tool names by removing dynamic suffixes (e.g., _xg7c). Defaults to True. |
True
|
include_output
|
bool
|
Whether to include the 'output' field from each step in the ToolCall object. Defaults to True. |
True
|
Returns:
| Type | Description |
|---|---|
list
|
List of ToolCall objects with 'name', 'input_parameters', and optionally 'output'. |
normalize_tool_name(tool_name)
Normalize tool names by removing dynamic suffixes.
Tool names like 'delegate_to_sql_query_agent_xg7c' or 'delegate_to_sql_query_agent_nxle' become 'delegate_to_sql_query_agent' by removing the last underscore and 4-char suffix.
Removes suffixes that look like random IDs (4 chars alphanumeric mix with both letters and digits, e.g. _xg7c, _nxle, _zor2), not legitimate tool name parts like '_data' or '_name'.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tool_name
|
str
|
The original tool name with potential suffix. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Normalized tool name without dynamic suffix. |
parse(tool_calls)
Parse the tool call data into steps.
TrajectoryParser()
TrajectoryParser is used to parse trajectory data into steps.
Initialize the TrajectoryParser.
convert_trajectory_to_steps(trajectory)
Convert trajectory data to expected steps format.
This function converts agent trajectory data into a flattened list of steps. When a tool call has both a step without output and a step with output, they share the same tool_call_id to indicate they represent the same execution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectory
|
list[dict]
|
List of trajectory messages with roles (user/assistant/tool). |
required |
Returns:
| Type | Description |
|---|---|
list[dict]
|
List of step dictionaries with tool_call_id, kind, name, args, and optional output. |
list[dict]
|
Steps representing the same tool call (with and without output) share the same tool_call_id. |
parse(trajectory)
Parse the trajectory data into steps.