Metrics

Metrics module.

`BaseMetric`

Bases: ABC

Abstract class for metrics.

This class defines the interface for all metrics.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`required_fields`	`set[str]`	The required fields for this metric to evaluate data.

`can_evaluate(data)`

Check if this metric can evaluate the given data.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput`	The input data to check.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if the metric can evaluate the data, False otherwise.

`evaluate(data)` `async`

Evaluate the metric on the given dataset.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput`	The data to evaluate the metric on.	required

Returns:

Name	Type	Description
`MetricOutput`	`MetricOutput`	A dictionary where the key are the namespace and the value are the scores.

`get_input_fields()` `classmethod`

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type	Description
`list[str] \| None`	list[str] \| None: The input fields.

`get_input_spec()` `classmethod`

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type	Description
`list[dict[str, Any]] \| None`	list[dict[str, Any]] \| None: The input spec.

`CompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)`

Bases: LMBasedMetric

Completeness metric.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`response_schema`	`ResponseSchema`	The response schema to use for the metric.
`prompt_builder`	`PromptBuilder`	The prompt builder to use for the metric.
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.
`model_credentials`	`str`	The model credentials to use for the metric.

Initialize the CompletenessMetric class.

Default expected input: - query (str): The query to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output.

Parameters:

Name	Type	Description	Default
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.	`None`
`prompt_builder`	`PromptBuilder \| None`	The prompt builder to use for the metric. Defaults to default prompt builder.	`None`
`response_schema`	`ResponseSchema \| None`	The response schema to use for the metric. Defaults to CompletenessResponseSchema.	`None`

`DeepEvalAnswerRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Answer Relevancy Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Initializes the DeepEvalAnswerRelevancyMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalBiasMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Bias Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more biased, closer to 0.0 means less biased.

Initializes the DeepEvalBiasMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalContextualPrecisionMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval ContextualPrecision Metric Integration.

Required Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_response (str | list[str]): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalContextualPrecisionMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalContextualRecallMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval ContextualRecall Metric Integration.

Required Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_response (str | list[str]): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalContextualRecallMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalContextualRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval ContextualRelevancy Metric Integration.

Required Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalContextualRelevancyMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalFaithfulnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Faithfulness Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalFaithfulnessMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalGEvalMetric(name, evaluation_params, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None)`

Bases: DeepEvalMetricFactory, PromptExtractionMixin

DeepEval GEval Metric Integration.

This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.

GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.

Available Fields: - query (str, optional): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted into a list with a single element. - retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalGEvalMetric class.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`evaluation_params`	`list[LLMTestCaseParams]`	The evaluation parameters.	required
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`MODEL`
`criteria`	`str \| None`	The criteria to use for the metric. Defaults to None.	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric. Defaults to None.	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric. Defaults to None.	`None`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`additional_context`	`str \| None`	Additional context like few-shot examples. Defaults to None.	`None`

`get_full_prompt(data)`

Get the full prompt that DeepEval generates for this metric.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput`	The metric input.	required

Returns:

Name	Type	Description
`str`	`str`	The complete prompt (system + user) as a string.

`DeepEvalHallucinationMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Hallucination Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str]): The expected context to evaluate the metric. Similar to context in LLMTestCaseParams.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more hallucinated, closer to 0.0 means less hallucinated.

Initializes the DeepEvalHallucinationMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalJsonCorrectnessMetric(expected_schema, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval JSON Correctness Metric Integration.

This metric evaluates whether a response is JSON correct according to a specified schema. It helps ensure that AI responses follow the expected JSON structure.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Categorical): - 0: The response is not JSON correct according to the schema. - 1: The response is JSON correct according to the schema.

Initializes the DeepEvalJsonCorrectnessMetric class.

Parameters:

Name	Type	Description	Default
`expected_schema`	`Type[BaseModel]`	The expected schema class (not instance) for the response. Example: `ExampleSchema` (the class, not an instance).	required
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4o".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

Raises:

Type	Description
`ValueError`	If expected_schema is not a valid BaseModel class.

`DeepEvalMetric(metric, name)`

Bases: BaseMetric

DeepEval Metric Integration.

Attributes:

Name	Type	Description
`metric`	`BaseMetric`	The DeepEval metric to wrap.
`name`	`str`	The name of the metric.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted into a list with a single element. - retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalMetric class.

Parameters:

Name	Type	Description	Default
`metric`	`BaseMetric`	The DeepEval metric to wrap.	required
`name`	`str`	The name of the metric.	required

`DeepEvalMetricFactory(name, model, model_credentials, model_config, **kwargs)`

Bases: DeepEvalMetric, ABC

Abstract base class for creating DeepEval metrics with a shared model invoker.

Initializes the metric, handling common model invoker creation.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name for the metric.	required
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model identifier or an existing LM invoker instance.	required
`model_credentials`	`Optional[str]`	Credentials for the model, required if `model` is a string.	required
`model_config`	`Optional[Dict[str, Any]]`	Configuration for the model.	required
`**kwargs`		Additional arguments for the specific DeepEval metric constructor.	`{}`

`DeepEvalMisuseMetric(domain, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Misuse Metric Integration.

This metric evaluates whether a response contains inappropriate misuse of the model. It helps ensure that AI responses don't provide harmful or inappropriate misuse of the model.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more misuse, closer to 0.0 means less misuse.

Initializes the DeepEvalMisuseMetric class.

Parameters:

Name	Type	Description	Default
`domain`	`str`	The domain to evaluate the metric. Common domains include: "finance", "health", "legal", "personal", "investment".	required
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4o".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

Raises:

Type	Description
`ValueError`	If domain is empty or contains invalid values.

`DeepEvalNonAdviceMetric(advice_types, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Non-Advice Metric Integration.

This metric evaluates whether a response contains inappropriate advice types. It helps ensure that AI responses don't provide harmful or inappropriate advice.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 0.0 means more inappropriate advice, closer to 1.0 means less inappropriate advice.

Initializes the DeepEvalNonAdviceMetric class.

Parameters:

Name	Type	Description	Default
`advice_types`	`List[str]`	List of advice types to detect as inappropriate. Common types include: ["financial", "medical", "legal", "personal", "investment"].	required
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4o".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

Raises:

Type	Description
`ValueError`	If advice_types is empty or contains invalid values.

`DeepEvalPIILeakageMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval PII Leakage Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more privacy violations, closer to 0.0 means less privacy violations.

Initializes the DeepEvalPIILeakageMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`DeepEvalPromptAlignmentMetric(prompt_instructions, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Prompt Alignment Metric Integration.

This metric evaluates whether a response is aligned with the prompt instructions. It helps ensure that AI responses are aligned with the prompt instructions.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output inLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more aligned with the prompt instructions, closer to 0.0 means less aligned with the prompt instructions.

Initializes the DeepEvalPromptAlignmentMetric class.

Parameters:

Name	Type	Description	Default
`prompt_instructions`	`List[str]`	a list of strings specifying the instructions you want followed in your prompt template.	required
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4o".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

Raises:

Type	Description
`ValueError`	If prompt_instructions is empty or contains invalid values.

`DeepEvalRoleViolationMetric(role, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Role Violation Metric Integration.

This metric evaluates whether a response contains role violations. It helps ensure that AI responses don't provide harmful or inappropriate role violations.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 0.0 means more role violations, closer to 1.0 means less role violations.

Initializes the DeepEvalRoleViolationMetric class.

Parameters:

Name	Type	Description	Default
`role`	`str`	The role to evaluate the metric. Common roles include: "helpful customer assistant", "medical insurance agent".	required
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4o".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

Raises:

Type	Description
`ValueError`	If role is empty or contains invalid values.

`DeepEvalToxicityMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

Bases: DeepEvalMetricFactory

DeepEval Toxicity Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more toxic, closer to 0.0 means less toxic.

Initializes the DeepEvalToxicityMetric class.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	The threshold to use for the metric. Defaults to 0.5.	`0.5`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric. Defaults to "openai/gpt-4.1".	`'openai/gpt-4.1'`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None. Required when model is a string.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to None.	`None`

`GEvalCompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)`

Bases: DeepEvalGEvalMetric

GEval Completeness Metric.

This metric is used to evaluate the completeness of the generated output.

Required Fields: - query (str): The query to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.
`model_credentials`	`str \| None`	The model credentials to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.

Initialize the GEval Completeness Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria to use for the metric. default is DEFAULT_CRITERIA	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric. default is DEFAULT_EVALUATION_STEPS	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric. default is DEFAULT_RUBRIC	`None`
`threshold`	`float`	The threshold to use for the metric. default is 0.5	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT]	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples. Defaults to None.	`None`

`GEvalGroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)`

Bases: DeepEvalGEvalMetric

GEval Groundedness Metric.

This metric is used to evaluate the groundedness of the generated output.

Required Fields: - query (str): The query to evaluate the groundedness of the model's output. - generated_response (str): The generated response to evaluate the groundedness of the model's output. - retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.
`model_credentials`	`str \| None`	The model credentials to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.

Initialize the GEval Groundedness Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria to use for the metric. default is DEFAULT_CRITERIA	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric. default is DEFAULT_EVALUATION_STEPS	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric. default is DEFAULT_RUBRIC	`None`
`threshold`	`float`	The threshold to use for the metric. default is 0.5	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.CONTEXT]	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples. Defaults to None.	`None`

`GEvalRedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)`

Bases: DeepEvalGEvalMetric

GEval Redundancy Metric.

This metric is used to evaluate the redundancy of the generated output.

Required Fields: - query (str): The query to evaluate the redundancy of the model's output. - generated_response (str): The generated response to evaluate the redundancy of the model's output.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.
`model_credentials`	`str \| None`	The model credentials to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.

Initialize the GEval Redundancy Metric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric.	`None`
`criteria`	`str \| None`	The criteria to use for the metric. default is REDUNDANCY_CRITERIA	`None`
`evaluation_steps`	`list[str] \| None`	The evaluation steps to use for the metric. default is REDUNDANCY_EVALUATION_STEPS	`None`
`rubric`	`list[Rubric] \| None`	The rubric to use for the metric. default is REDUNDANCY_RUBRIC	`None`
`threshold`	`float`	The threshold to use for the metric. default is 0.5	`0.5`
`evaluation_params`	`list[LLMTestCaseParams] \| None`	The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]	`None`
`additional_context`	`str \| None`	Additional context like few-shot examples. Defaults to None.	`None`

`GroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)`

Bases: LMBasedMetric

Groundedness metric.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`response_schema`	`ResponseSchema`	The response schema to use for the metric.
`prompt_builder`	`PromptBuilder`	The prompt builder to use for the metric.
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.
`model_credentials`	`str`	The model credentials to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.

Initialize the GroundednessMetric class.

Default expected input: - query (str): The query to evaluate the groundedness of the model's output. - retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output. - generated_response (str): The generated response to evaluate the groundedness of the model's output.

Parameters:

Name	Type	Description	Default
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.	`MODEL`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.	`None`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None.	`None`
`prompt_builder`	`PromptBuilder \| None`	The prompt builder to use for the metric. Defaults to default prompt builder.	`None`
`response_schema`	`ResponseSchema \| None`	The response schema to use for the metric. Defaults to GroundednessResponseSchema.	`None`

`LMBasedMetric(name, response_schema, prompt_builder, model=DefaultValues.MODEL, model_credentials=None, model_config=None, parse_response_fn=None)`

Bases: BaseMetric

A multi purpose LM-based metric class.

This class is a multi purpose LM-based metric class that can be used to evaluate the performance of a LM-based metric. It can be used to evaluate the performance of a LM-based metric by providing a response schema, a prompt builder, a model id, and a model credentials.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`response_schema`	`ResponseSchema`	The response schema to use for the metric.
`prompt_builder`	`PromptBuilder`	The prompt builder to use for the metric.
`model_credentials`	`str`	The model credentials to use for the metric.
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty.

Initialize the LMBasedMetric class.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`response_schema`	`ResponseSchema`	The response schema to use for the metric.	required
`prompt_builder`	`PromptBuilder`	The prompt builder to use for the metric.	required
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.	`None`
`parse_response_fn`	`Callable[[str \| LMOutput], MetricOutput] \| None`	The function to use to parse the response from the LM. Defaults to a function that parses the response from the LM.	`None`

`LangChainAgentEvalsLLMAsAJudgeMetric(name, prompt, model, credentials=None, config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainAgentEvalsMetric

A metric that uses LangChain AgentEvals to evaluate Agent as a judge.

Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`evaluator`	`SimpleAsyncEvaluator`	The evaluator to use.

Initialize the LangChainAgentEvalsLLMAsAJudgeMetric.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`prompt`	`str`	The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages. Note that the default prompt allows a rubric in addition to the typical "inputs", "outputs", and "reference_outputs" parameters.	required
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	required
`credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "trajectory_accuracy".	`'trajectory_accuracy'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`LangChainAgentEvalsMetric(name, evaluator)`

Bases: BaseMetric

A metric that uses LangChain AgentEvals to evaluate Agent.

Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`evaluator`	`SimpleAsyncEvaluator`	The evaluator to use.

Initialize the LangChainAgentEvalsMetric.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`evaluator`	`SimpleAsyncEvaluator`	The evaluator to use.	required

`LangChainAgentTrajectoryAccuracyMetric(model, prompt=None, model_credentials=None, model_config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, use_reference=True)`

Bases: LangChainAgentEvalsLLMAsAJudgeMetric

A metric that uses LangChain AgentEvals to evaluate the trajectory accuracy of the agent.

Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`evaluator`	`SimpleAsyncEvaluator`	The evaluator to use.

Initialize the LangChainAgentTrajectoryAccuracyMetric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	required
`prompt`	`str \| None`	The prompt to use. Defaults to None.	`None`
`model_credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "trajectory_accuracy".	`'trajectory_accuracy'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`
`use_reference`	`bool`	If True, uses the expected agent trajectory to evaluate the trajectory accuracy. Defaults to True. If False, the TRAJECTORY_ACCURACY_CUSTOM_PROMPT is used to evaluate the trajectory accuracy. If custom prompt is provided, this parameter will be ignored.	`True`

`LangChainConcisenessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the conciseness of the LLM.

Required Fields: - query (str): The query to evaluate the conciseness of. - generated_response (str): The generated response to evaluate the conciseness of.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`prompt`	`str`	The prompt to use.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.

Initialize the LangChainConcisenessMetric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	`MODEL`
`system`	`str \| None`	Optional system message to prepend to the prompt.	`None`
`model_credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "score".	`'score'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`LangChainCorrectnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the correctness of the LLM.

Required Fields: - query (str): The query to evaluate the correctness of. - generated_response (str): The generated response to evaluate the correctness of. - expected_response (str): The expected response to evaluate the correctness of.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`prompt`	`str`	The prompt to use.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.

Initialize the LangChainCorrectnessMetric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	`MODEL`
`system`	`str \| None`	Optional system message to prepend to the prompt.	`None`
`model_credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "score".	`'score'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`LangChainGroundednessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the groundedness of the LLM.

Required Fields: - generated_response (str | list[str]): The generated response to evaluate the groundedness of. - retrieved_context (str | list[str]): The retrieved context to evaluate the groundedness of.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`prompt`	`str`	The prompt to use.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.

Initialize the LangChainGroundednessMetric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	`MODEL`
`system`	`str \| None`	Optional system message to prepend to the prompt.	`None`
`model_credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "score".	`'score'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`LangChainHallucinationMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the hallucination of the LLM.

Required Fields: - query (str): The query to evaluate the hallucination of. - generated_response (str): The generated response to evaluate the hallucination of. - expected_retrieved_context (str): The expected retrieved context to evaluate the hallucination of. - expected_response (str, optional): Additional information to help the model evaluate the hallucination.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`prompt`	`str`	The prompt to use.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.

Initialize the LangChainHallucinationMetric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	`MODEL`
`system`	`str \| None`	Optional system message to prepend to the prompt.	`None`
`model_credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "score".	`'score'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`LangChainHelpfulnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the helpfulness of the LLM.

Required Fields: - query (str): The query to evaluate the helpfulness of. - generated_response (str): The generated response to evaluate the helpfulness of.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`prompt`	`str`	The prompt to use.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.

Initialize the LangChainHelpfulnessMetric.

Parameters:

Name	Type	Description	Default
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	`MODEL`
`system`	`str \| None`	Optional system message to prepend to the prompt.	`None`
`model_credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "score".	`'score'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`LangChainOpenEvalsLLMAsAJudgeMetric(name, prompt, model, system=None, credentials=None, config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

Bases: LangChainOpenEvalsMetric

A metric that uses LangChain and OpenEvals to evaluate the LLM as a judge.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`prompt`	`str`	The prompt to use.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.

Initialize the LangChainOpenEvalsLLMAsAJudgeMetric.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`prompt`	`str`	The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages.	required
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use.	required
`system`	`str \| None`	Optional system message to prepend to the prompt.	`None`
`credentials`	`str \| None`	The credentials to use for the model. Defaults to None.	`None`
`config`	`dict[str, Any] \| None`	The config to use for the model. Defaults to None.	`None`
`schema`	`ResponseSchema \| None`	The schema to use for the model. Defaults to None.	`None`
`feedback_key`	`str`	Key used to store the evaluation result, defaults to "score".	`'score'`
`continuous`	`bool`	If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.	`False`
`choices`	`list[float] \| None`	Optional list of specific float values the score must be chosen from. Defaults to None.	`None`
`use_reasoning`	`bool`	If True, includes explanation for the score in the output. Defaults to True.	`True`
`few_shot_examples`	`list[FewShotExample] \| None`	Optional list of example evaluations to append to the prompt. Defaults to None.	`None`

`LangChainOpenEvalsMetric(name, evaluator)`

Bases: BaseMetric

A metric that uses LangChain and OpenEvals.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`evaluator`	`Union[SimpleAsyncEvaluator, Callable[..., Awaitable[Any]]]`	The evaluator to use.

Initialize the LangChainOpenEvalsMetric.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`evaluator`	`Union[SimpleAsyncEvaluator, Callable[..., Awaitable[Any]]]`	The evaluator to use.	required

`PyTrecMetric(metrics=None, k=20)`

Bases: BaseMetric

Pytrec_eval metric.

Required fields: - retrieved_chunks: The retrieved chunk ids with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.

Example:

data = RetrievalData(
    retrieved_chunks={
        "chunk1": 0.9,
        "chunk2": 0.8,
        "chunk3": 0.7,
    },
    ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
metric = PyTrecMetric()
await metric.evaluate(data)

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`metrics`	`list[PyTrecEvalMetric \| str] \| set[PyTrecEvalMetric \| str] \| None`	The metrics to evaluate.
`k_values`	`int \| list[int]`	The number of retrieved chunks to consider.

Initializes the PyTrecMetric.

Parameters:

Name	Type	Description	Default
`metrics`	`list[PyTrecEvalMetric \| str] \| set[PyTrecEvalMetric \| str] \| None`	The metrics to evaluate. Defaults to all metrics.	`None`
`k`	`int \| list[int]`	The number of retrieved chunks to consider. Defaults to 20.	`20`

`RAGASMetric(metric, name=None, callbacks=None, timeout=None)`

Bases: BaseMetric

RAGAS metric.

RAGAS is a metric for evaluating the quality of RAG systems.

Attributes:

Name	Type	Description
`metric`	`SingleTurnMetric`	The Ragas metric to use.
`name`	`str`	The name of the metric.
`callbacks`	`Callbacks`	The callbacks to use.
`timeout`	`int`	The timeout for the metric.

Available Fields: - query (str): The query to evaluate the metric. Similar to user_input in SingleTurnSample. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to response in SingleTurnSample. If the generated response is a list, the responses are concatenated into a single string. For multiple responses, use list[str]. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to reference in SingleTurnSample. If the expected response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to reference_contexts in SingleTurnSample. If the expected retrieved context is a str, it will be converted into a list with a single element. - retrieved_context (str | list[str], optional): The retrieved context to evaluate the metric. Similar to retrieved_contexts in SingleTurnSample. If the retrieved context is a str, it will be converted into a list with a single element. - rubrics (dict[str, str], optional): The rubrics to evaluate the metric. Similar to rubrics in SingleTurnSample.

Initialize the RAGASMetric.

Parameters:

Name	Type	Description	Default
`metric`	`SingleTurnMetric`	The Ragas metric to use.	required
`name`	`str`	The name of the metric. Default is the name of the metric.	`None`
`callbacks`	`Callbacks`	The callbacks to use. Default is None.	`None`
`timeout`	`int`	The timeout for the metric. Default is None.	`None`

`RagasContextPrecisionWithoutReference(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)`

Bases: RAGASMetric

RAGAS Context Recall metric.

Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for. - retrieved_contexts (list[str]): The retrieved contexts to recall the context for.

Initialize the RagasContextPrecisionWithoutReference metric.

Parameters:

Name	Type	Description	Default
`lm_model`	`str \| ModelId \| BaseLMInvoker`	The language model to use.	required
`lm_model_credentials`	`str \| None`	The credentials to use for the language model. Default is None.	`None`
`lm_model_config`	`dict[str, Any] \| None`	The configuration to use for the language model. Default is None.	`None`
`**kwargs`		Additional keyword arguments to pass to the RagasContextRecall metric.	`{}`

`RagasContextRecall(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)`

Bases: RAGASMetric

RAGAS Context Recall metric.

Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for. - expected_response (str): The expected response to recall the context for. - retrieved_contexts (list[str]): The retrieved contexts to recall the context for.

Initialize the RagasContextRecall metric.

Parameters:

Name	Type	Description	Default
`lm_model`	`str \| ModelId \| BaseLMInvoker`	The language model to use.	required
`lm_model_credentials`	`str \| None`	The credentials to use for the language model. Default is None.	`None`
`lm_model_config`	`dict[str, Any] \| None`	The configuration to use for the language model. Default is None.	`None`
`**kwargs`		Additional keyword arguments to pass to the RagasContextRecall metric.	`{}`

`RagasFactualCorrectness(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, **kwargs)`

Bases: RAGASMetric

RAGAS Factual Correctness metric.

Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for.

Initialize the RagasFactualCorrectness metric.

Parameters:

Name	Type	Description	Default
`lm_model`	`str \| ModelId \| BaseLMInvoker`	The language model to use.	`MODEL`
`lm_model_credentials`	`str \| None`	The credentials to use for the language model. Default is None.	`None`
`lm_model_config`	`dict[str, Any] \| None`	The configuration to use for the language model. Default is None.	`None`
`**kwargs`		Additional keyword arguments to pass to the RagasFactualCorrectness metric.	`{}`

`RedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)`

Bases: LMBasedMetric

Redundancy metric.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`response_schema`	`ResponseSchema`	The response schema to use for the metric.
`prompt_builder`	`PromptBuilder`	The prompt builder to use for the metric.
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.
`model_credentials`	`str`	The model credentials to use for the metric.

Initialize the RedundancyMetric class.

Default expected input: - query (str): The query to evaluate the redundancy of the model's output. - generated_response (str): The generated response to evaluate the redundancy of the model's output.

Parameters:

Name	Type	Description	Default
`model`	`Union[str, ModelId, BaseLMInvoker]`	The model to use for the metric.	`MODEL`
`model_credentials`	`str \| None`	The model credentials to use for the metric. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metric. Defaults to an empty dictionary.	`None`
`prompt_builder`	`PromptBuilder \| None`	The prompt builder to use for the metric. Defaults to default prompt builder.	`None`
`response_schema`	`ResponseSchema \| None`	The response schema to use for the metric. Defaults to RedundancyResponseSchema.	`None`

`TopKAccuracy(k=20)`

Bases: BaseMetric

Top K Accuracy metric.

Required fields: - retrieved_chunks: The retrieved chunk ids with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.

Example:

data = RetrievalData(
    retrieved_chunks={
        "chunk1": 0.9,
        "chunk2": 0.8,
        "chunk3": 0.7,
    },
    ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
metric = TopKAccuracy()
await metric.evaluate(data)

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`k_values`	`list[int]`	The number of retrieved chunks to consider.

Initializes the TopKAccuracy.

Parameters:

Name	Type	Description	Default
`k`	`list[int] \| int`	The number of retrieved chunks to consider. Defaults to 20.	`20`

`top_k_accuracy(qrels, results)`

Evaluates the top k accuracy.

Parameters:

Name	Type	Description	Default
`qrels`	`dict[str, dict[str, int]]`	The ground truth of the retrieved chunks. There are two possible values:	required
`-`	`1`	the chunk is relevant to the query.	required
`-`	`0`	the chunk is not relevant to the query.	required
`results`	`dict[str, dict[str, float]]`	The retrieved chunks with their similarity score.	required

Returns:

Type	Description
`dict[str, float]`	dict[str, float]: The top k accuracy.

Example:

qrels = {
    "q1": {"chunk1": 1, "chunk2": 1},
}
results = {"q1": {"chunk1": 0.9, "chunk2": 0.8, "chunk3": 0.7}}

Metrics

BaseMetric

can_evaluate(data)

evaluate(data) async

get_input_fields() classmethod

get_input_spec() classmethod

CompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)

DeepEvalAnswerRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

DeepEvalBiasMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalContextualPrecisionMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

DeepEvalContextualRecallMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

DeepEvalContextualRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

DeepEvalFaithfulnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

DeepEvalGEvalMetric(name, evaluation_params, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None)

get_full_prompt(data)

DeepEvalHallucinationMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalJsonCorrectnessMetric(expected_schema, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalMetric(metric, name)

DeepEvalMetricFactory(name, model, model_credentials, model_config, **kwargs)

DeepEvalMisuseMetric(domain, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalNonAdviceMetric(advice_types, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalPIILeakageMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalPromptAlignmentMetric(prompt_instructions, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalRoleViolationMetric(role, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

DeepEvalToxicityMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

GEvalCompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)

GEvalGroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)

GEvalRedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)

GroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)

LMBasedMetric(name, response_schema, prompt_builder, model=DefaultValues.MODEL, model_credentials=None, model_config=None, parse_response_fn=None)

LangChainAgentEvalsLLMAsAJudgeMetric(name, prompt, model, credentials=None, config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

LangChainAgentEvalsMetric(name, evaluator)

LangChainAgentTrajectoryAccuracyMetric(model, prompt=None, model_credentials=None, model_config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, use_reference=True)

LangChainConcisenessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

LangChainCorrectnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

LangChainGroundednessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

LangChainHallucinationMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

LangChainHelpfulnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

LangChainOpenEvalsLLMAsAJudgeMetric(name, prompt, model, system=None, credentials=None, config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

LangChainOpenEvalsMetric(name, evaluator)

PyTrecMetric(metrics=None, k=20)

RAGASMetric(metric, name=None, callbacks=None, timeout=None)

RagasContextPrecisionWithoutReference(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)

RagasContextRecall(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)

RagasFactualCorrectness(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, **kwargs)

RedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)

TopKAccuracy(k=20)

top_k_accuracy(qrels, results)

`BaseMetric`

`can_evaluate(data)`

`evaluate(data)` `async`

`get_input_fields()` `classmethod`

`get_input_spec()` `classmethod`

`CompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)`

`DeepEvalAnswerRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

`DeepEvalBiasMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalContextualPrecisionMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

`DeepEvalContextualRecallMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

`DeepEvalContextualRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

`DeepEvalFaithfulnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)`

`DeepEvalGEvalMetric(name, evaluation_params, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None)`

`get_full_prompt(data)`

`DeepEvalHallucinationMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalJsonCorrectnessMetric(expected_schema, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalMetric(metric, name)`

`DeepEvalMetricFactory(name, model, model_credentials, model_config, **kwargs)`

`DeepEvalMisuseMetric(domain, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalNonAdviceMetric(advice_types, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalPIILeakageMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalPromptAlignmentMetric(prompt_instructions, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalRoleViolationMetric(role, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`DeepEvalToxicityMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)`

`GEvalCompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)`

`GEvalGroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)`

`GEvalRedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)`

`GroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)`

`LMBasedMetric(name, response_schema, prompt_builder, model=DefaultValues.MODEL, model_credentials=None, model_config=None, parse_response_fn=None)`

`LangChainAgentEvalsLLMAsAJudgeMetric(name, prompt, model, credentials=None, config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`LangChainAgentEvalsMetric(name, evaluator)`

`LangChainAgentTrajectoryAccuracyMetric(model, prompt=None, model_credentials=None, model_config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, use_reference=True)`

`LangChainConcisenessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`LangChainCorrectnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`LangChainGroundednessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`LangChainHallucinationMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`LangChainHelpfulnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`LangChainOpenEvalsLLMAsAJudgeMetric(name, prompt, model, system=None, credentials=None, config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)`

`LangChainOpenEvalsMetric(name, evaluator)`

`PyTrecMetric(metrics=None, k=20)`

`RAGASMetric(metric, name=None, callbacks=None, timeout=None)`

`RagasContextPrecisionWithoutReference(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)`

`RagasContextRecall(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)`

`RagasFactualCorrectness(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, **kwargs)`

`RedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)`

`TopKAccuracy(k=20)`

`top_k_accuracy(qrels, results)`