Deepeval tool correctness
DeepEval Tool Correctness Metric Integration.
DeepEvalToolCorrectnessMetric(threshold=0.5, model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalMetricFactory
DeepEval Tool Correctness Metric Integration.
This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.
Available Fields: - query (str): The input query. - generated_response (str, optional): The actual output/response. - expected_response (str, optional): The expected output/response. - tools_called (list[dict], optional): The tools actually called by the agent. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from agent_trajectory. - expected_tools (list[dict], optional): The expected tools to be called. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory. - agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field. - expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory. - available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.
Example Usage (with agent_trajectory):
metric = DeepEvalToolCorrectnessMetric(
model_credentials="your-api-key",
available_tools=[...]
)
result = await metric.measure(
query="What is the average sales amount in the orders table?",
agent_trajectory=[
{"role": "user", "content": "What is the average sales amount..."},
{
"role": "assistant",
"content": "",
"tool_calls": [{
"id": "call_1",
"type": "function",
"function": {
"name": "data_checker",
"arguments": '{"query": "SELECT AVG(amount) as avg_sales FROM orders LIMIT 1"}'
}
}]
},
{"role": "tool", "tool_call_id": "call_1", "content": '[{"avg_sales": 250.50}]'},
{"role": "assistant", "content": "Based on the data, the average sales amount is $250.50."}
],
expected_agent_trajectory=[...], # Same structure
generated_response="Based on the data, the average sales amount is $250.50.",
expected_response="The average sales amount in the orders table is $250.50."
)
print(result.score, result.reason)
Example Usage (with tools_called):
result = await metric.measure(
query="What is 15 plus 27?",
tools_called=[
{"name": "calculator", "args": {"expression": "15 + 27"}, "output": "42"}
],
expected_tools=[
{"name": "calculator", "args": {"expression": "15 + 27"}, "output": "42"}
],
generated_response="15 plus 27 equals 42.",
expected_response="15 plus 27 equals 42."
)
Initializes DeepEvalToolCorrectnessMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. Also used as good_score for BaseEvaluator's global explanation generation. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
AGENT_EVALS_MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
include_reason
|
bool
|
Include reasoning in output. Defaults to True. |
True
|
strict_mode
|
bool
|
Binary mode (0 or 1). Defaults to False. |
False
|
should_exact_match
|
bool
|
Require exact match of tools. Defaults to False. |
False
|
should_consider_ordering
|
bool
|
Consider order of tools called. Defaults to False. |
False
|
available_tools
|
list[dict[str, Any]] | None
|
List of available tool definitions for context. Each tool should have 'name', 'description', and 'parameters'. Defaults to None. |
None
|
evaluation_params
|
list[ToolCallParams] | None
|
List of strictness criteria for tool correctness. Available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT. Defaults to [ToolCallParams.INPUT_PARAMETERS, ToolCallParams.OUTPUT] to validate both. |
None
|
batch_status_check_interval
|
float
|
Interval in seconds between batch status checks. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of batch status check iterations. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
ToolCallParser()
ToolCallParser is used to parse tool call data into steps.
Initialize the ToolCallParser.
create_tool_call_object(steps, normalize_names=True, include_output=True)
Extract tool calls from agent steps.
This is a shared utility function used by all metric classes to extract tool calls from agent steps in a consistent manner.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
steps
|
list[dict[str, Any]]
|
List of agent steps with 'name', 'args', and optional 'output'. |
required |
normalize_names
|
bool
|
Whether to normalize tool names by removing dynamic suffixes (e.g., _xg7c). Defaults to True. |
True
|
include_output
|
bool
|
Whether to include the 'output' field from each step in the ToolCall object. Defaults to True. |
True
|
Returns:
| Type | Description |
|---|---|
list
|
List of ToolCall objects with 'name', 'input_parameters', and optionally 'output'. |
normalize_tool_name(tool_name)
Normalize tool names by removing dynamic suffixes.
Tool names like 'delegate_to_sql_query_agent_xg7c' or 'delegate_to_sql_query_agent_nxle' become 'delegate_to_sql_query_agent' by removing the last underscore and 4-char suffix.
Removes suffixes that look like random IDs (4 chars alphanumeric mix with both letters and digits, e.g. _xg7c, _nxle, _zor2), not legitimate tool name parts like '_data' or '_name'.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tool_name
|
str
|
The original tool name with potential suffix. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Normalized tool name without dynamic suffix. |
parse(tool_calls)
Parse the tool call data into steps.
TrajectoryParser()
TrajectoryParser is used to parse trajectory data into steps.
Initialize the TrajectoryParser.
convert_trajectory_to_steps(trajectory)
Convert trajectory data to expected steps format.
This function converts agent trajectory data into a flattened list of steps. When a tool call has both a step without output and a step with output, they share the same tool_call_id to indicate they represent the same execution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
trajectory
|
list[dict]
|
List of trajectory messages with roles (user/assistant/tool). |
required |
Returns:
| Type | Description |
|---|---|
list[dict]
|
List of step dictionaries with tool_call_id, kind, name, args, and optional output. |
list[dict]
|
Steps representing the same tool call (with and without output) share the same tool_call_id. |
parse(trajectory)
Parse the trajectory data into steps.