Deepeval tool correctness

DeepEval Tool Correctness Metric Integration.

`DeepEvalToolCorrectnessMetric(threshold=0.5, model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)`

Bases: DeepEvalMetricFactory

DeepEval Tool Correctness Metric Integration.

This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.

Available Fields: - query (str): The input query. - generated_response (str, optional): The actual output/response. - expected_response (str, optional): The expected output/response. - tools_called (list[dict], optional): The tools actually called by the agent. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from agent_trajectory. - expected_tools (list[dict], optional): The expected tools to be called. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory. - agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field. - expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory. - available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.

Example Usage (with agent_trajectory):

    metric = DeepEvalToolCorrectnessMetric(
        model_credentials="your-api-key",
        available_tools=[...]
    )
    result = await metric.measure(
        query="What is the average sales amount in the orders table?",
        agent_trajectory=[
            {"role": "user", "content": "What is the average sales amount..."},
            {
                "role": "assistant",
                "content": "",
                "tool_calls": [{
                    "id": "call_1",
                    "type": "function",
                    "function": {
                        "name": "data_checker",
                        "arguments": '{"query": "SELECT AVG(amount) as avg_sales FROM orders LIMIT 1"}'
                    }
                }]
            },
            {"role": "tool", "tool_call_id": "call_1", "content": '[{"avg_sales": 250.50}]'},
            {"role": "assistant", "content": "Based on the data, the average sales amount is $250.50."}
        ],
        expected_agent_trajectory=[...],  # Same structure
        generated_response="Based on the data, the average sales amount is $250.50.",
        expected_response="The average sales amount in the orders table is $250.50."
    )
    print(result.score, result.reason)

Example Usage (with tools_called):

    result = await metric.measure(
        query="What is 15 plus 27?",
        tools_called=[
            {"name": "calculator", "args": {"expression": "15 + 27"}, "output": "42"}
        ],
        expected_tools=[
            {"name": "calculator", "args": {"expression": "15 + 27"}, "output": "42"}
        ],
        generated_response="15 plus 27 equals 42.",
        expected_response="15 plus 27 equals 42."
    )

Initializes DeepEvalToolCorrectnessMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5. Also used as good_score for BaseEvaluator's global explanation generation.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`AGENT_EVALS_MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`
`include_reason`	`bool`	Include reasoning in output. Defaults to True.	`True`
`strict_mode`	`bool`	Binary mode (0 or 1). Defaults to False.	`False`
`should_exact_match`	`bool`	Require exact match of tools. Defaults to False.	`False`
`should_consider_ordering`	`bool`	Consider order of tools called. Defaults to False.	`False`
`available_tools`	`list[dict[str, Any]] \| None`	List of available tool definitions for context. Each tool should have 'name', 'description', and 'parameters'. Defaults to None.	`None`
`evaluation_params`	`list[ToolCallParams] \| None`	List of strictness criteria for tool correctness. Available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT. Defaults to [ToolCallParams.INPUT_PARAMETERS, ToolCallParams.OUTPUT] to validate both.	`None`
`batch_status_check_interval`	`float`	Interval in seconds between batch status checks. Defaults to 30.0.	`BATCH_STATUS_CHECK_INTERVAL`
`batch_max_iterations`	`int`	Maximum number of batch status check iterations. Defaults to 120.	`BATCH_MAX_ITERATIONS`

`ToolCallParser()`

ToolCallParser is used to parse tool call data into steps.

Initialize the ToolCallParser.

`create_tool_call_object(steps, normalize_names=True, include_output=True)`

Extract tool calls from agent steps.

This is a shared utility function used by all metric classes to extract tool calls from agent steps in a consistent manner.

Parameters:

Name	Type	Description	Default
`steps`	`list[dict[str, Any]]`	List of agent steps with 'name', 'args', and optional 'output'.	required
`normalize_names`	`bool`	Whether to normalize tool names by removing dynamic suffixes (e.g., _xg7c). Defaults to True.	`True`
`include_output`	`bool`	Whether to include the 'output' field from each step in the ToolCall object. Defaults to True.	`True`

Returns:

Type	Description
`list`	List of ToolCall objects with 'name', 'input_parameters', and optionally 'output'.

`normalize_tool_name(tool_name)`

Normalize tool names by removing dynamic suffixes.

Tool names like 'delegate_to_sql_query_agent_xg7c' or 'delegate_to_sql_query_agent_nxle' become 'delegate_to_sql_query_agent' by removing the last underscore and 4-char suffix.

Removes suffixes that look like random IDs (4 chars alphanumeric mix with both letters and digits, e.g. _xg7c, _nxle, _zor2), not legitimate tool name parts like '_data' or '_name'.

Parameters:

Name	Type	Description	Default
`tool_name`	`str`	The original tool name with potential suffix.	required

Returns:

Type	Description
`str`	Normalized tool name without dynamic suffix.

`parse(tool_calls)`

Parse the tool call data into steps.

`TrajectoryParser()`

TrajectoryParser is used to parse trajectory data into steps.

Initialize the TrajectoryParser.

`convert_trajectory_to_steps(trajectory)`

Convert trajectory data to expected steps format.

This function converts agent trajectory data into a flattened list of steps. When a tool call has both a step without output and a step with output, they share the same tool_call_id to indicate they represent the same execution.

Parameters:

Name	Type	Description	Default
`trajectory`	`list[dict]`	List of trajectory messages with roles (user/assistant/tool).	required

Returns:

Type	Description
`list[dict]`	List of step dictionaries with tool_call_id, kind, name, args, and optional output.
`list[dict]`	Steps representing the same tool call (with and without output) share the same tool_call_id.

`parse(trajectory)`

Parse the trajectory data into steps.