Skip to content

Deepeval tool correctness

DeepEval Tool Correctness Metric Integration.

DeepEvalToolCorrectnessMetric(threshold=0.5, model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: DeepEvalMetricFactory

DeepEval Tool Correctness Metric Integration.

This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.

Available Fields: - query (str): The input query. - generated_response (str, optional): The actual output/response. - expected_response (str, optional): The expected output/response. - tools_called (list[dict], optional): The tools actually called by the agent. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from agent_trajectory. - expected_tools (list[dict], optional): The expected tools to be called. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory. - agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field. - expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory. - available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.

Example Usage (with agent_trajectory):

    metric = DeepEvalToolCorrectnessMetric(
        model_credentials="your-api-key",
        available_tools=[...]
    )
    result = await metric.measure(
        query="What is the average sales amount in the orders table?",
        agent_trajectory=[
            {"role": "user", "content": "What is the average sales amount..."},
            {
                "role": "assistant",
                "content": "",
                "tool_calls": [{
                    "id": "call_1",
                    "type": "function",
                    "function": {
                        "name": "data_checker",
                        "arguments": '{"query": "SELECT AVG(amount) as avg_sales FROM orders LIMIT 1"}'
                    }
                }]
            },
            {"role": "tool", "tool_call_id": "call_1", "content": '[{"avg_sales": 250.50}]'},
            {"role": "assistant", "content": "Based on the data, the average sales amount is $250.50."}
        ],
        expected_agent_trajectory=[...],  # Same structure
        generated_response="Based on the data, the average sales amount is $250.50.",
        expected_response="The average sales amount in the orders table is $250.50."
    )
    print(result.score, result.reason)

Example Usage (with tools_called):

    result = await metric.measure(
        query="What is 15 plus 27?",
        tools_called=[
            {"name": "calculator", "args": {"expression": "15 + 27"}, "output": "42"}
        ],
        expected_tools=[
            {"name": "calculator", "args": {"expression": "15 + 27"}, "output": "42"}
        ],
        generated_response="15 plus 27 equals 42.",
        expected_response="15 plus 27 equals 42."
    )

Initializes DeepEvalToolCorrectnessMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5. Also used as good_score for BaseEvaluator's global explanation generation.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

AGENT_EVALS_MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
include_reason bool

Include reasoning in output. Defaults to True.

True
strict_mode bool

Binary mode (0 or 1). Defaults to False.

False
should_exact_match bool

Require exact match of tools. Defaults to False.

False
should_consider_ordering bool

Consider order of tools called. Defaults to False.

False
available_tools list[dict[str, Any]] | None

List of available tool definitions for context. Each tool should have 'name', 'description', and 'parameters'. Defaults to None.

None
evaluation_params list[ToolCallParams] | None

List of strictness criteria for tool correctness. Available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT. Defaults to [ToolCallParams.INPUT_PARAMETERS, ToolCallParams.OUTPUT] to validate both.

None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS

ToolCallParser()

ToolCallParser is used to parse tool call data into steps.

Initialize the ToolCallParser.

create_tool_call_object(steps, normalize_names=True, include_output=True)

Extract tool calls from agent steps.

This is a shared utility function used by all metric classes to extract tool calls from agent steps in a consistent manner.

Parameters:

Name Type Description Default
steps list[dict[str, Any]]

List of agent steps with 'name', 'args', and optional 'output'.

required
normalize_names bool

Whether to normalize tool names by removing dynamic suffixes (e.g., _xg7c). Defaults to True.

True
include_output bool

Whether to include the 'output' field from each step in the ToolCall object. Defaults to True.

True

Returns:

Type Description
list

List of ToolCall objects with 'name', 'input_parameters', and optionally 'output'.

normalize_tool_name(tool_name)

Normalize tool names by removing dynamic suffixes.

Tool names like 'delegate_to_sql_query_agent_xg7c' or 'delegate_to_sql_query_agent_nxle' become 'delegate_to_sql_query_agent' by removing the last underscore and 4-char suffix.

Removes suffixes that look like random IDs (4 chars alphanumeric mix with both letters and digits, e.g. _xg7c, _nxle, _zor2), not legitimate tool name parts like '_data' or '_name'.

Parameters:

Name Type Description Default
tool_name str

The original tool name with potential suffix.

required

Returns:

Type Description
str

Normalized tool name without dynamic suffix.

parse(tool_calls)

Parse the tool call data into steps.

TrajectoryParser()

TrajectoryParser is used to parse trajectory data into steps.

Initialize the TrajectoryParser.

convert_trajectory_to_steps(trajectory)

Convert trajectory data to expected steps format.

This function converts agent trajectory data into a flattened list of steps. When a tool call has both a step without output and a step with output, they share the same tool_call_id to indicate they represent the same execution.

Parameters:

Name Type Description Default
trajectory list[dict]

List of trajectory messages with roles (user/assistant/tool).

required

Returns:

Type Description
list[dict]

List of step dictionaries with tool_call_id, kind, name, args, and optional output.

list[dict]

Steps representing the same tool call (with and without output) share the same tool_call_id.

parse(trajectory)

Parse the trajectory data into steps.