Skip to content

Metrics

Metrics module.

BaseMetric

Bases: ABC

Abstract class for metrics.

This class defines the interface for all metrics.

Attributes:

Name Type Description
name str

The name of the metric.

required_fields set[str]

The required fields for this metric to evaluate data.

can_evaluate(data)

Check if this metric can evaluate the given data.

Parameters:

Name Type Description Default
data MetricInput

The input data to check.

required

Returns:

Name Type Description
bool bool

True if the metric can evaluate the data, False otherwise.

evaluate(data) async

Evaluate the metric on the given dataset.

Parameters:

Name Type Description Default
data MetricInput

The data to evaluate the metric on.

required

Returns:

Name Type Description
MetricOutput MetricOutput

A dictionary where the key are the namespace and the value are the scores.

get_input_fields() classmethod

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type Description
list[str] | None

list[str] | None: The input fields.

get_input_spec() classmethod

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type Description
list[dict[str, Any]] | None

list[dict[str, Any]] | None: The input spec.

CompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)

Bases: LMBasedMetric

Completeness metric.

Attributes:

Name Type Description
name str

The name of the metric.

response_schema ResponseSchema

The response schema to use for the metric.

prompt_builder PromptBuilder

The prompt builder to use for the metric.

model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

model_credentials str

The model credentials to use for the metric.

Initialize the CompletenessMetric class.

Default expected input: - query (str): The query to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output.

Parameters:

Name Type Description Default
model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

None
prompt_builder PromptBuilder | None

The prompt builder to use for the metric. Defaults to default prompt builder.

None
response_schema ResponseSchema | None

The response schema to use for the metric. Defaults to CompletenessResponseSchema.

None

DeepEvalAnswerRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Answer Relevancy Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Initializes the DeepEvalAnswerRelevancyMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalBiasMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Bias Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more biased, closer to 0.0 means less biased.

Initializes the DeepEvalBiasMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalContextualPrecisionMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval ContextualPrecision Metric Integration.

Required Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_response (str | list[str]): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalContextualPrecisionMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalContextualRecallMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval ContextualRecall Metric Integration.

Required Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_response (str | list[str]): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalContextualRecallMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalContextualRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval ContextualRelevancy Metric Integration.

Required Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalContextualRelevancyMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalFaithfulnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Faithfulness Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalFaithfulnessMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalGEvalMetric(name, evaluation_params, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None)

Bases: DeepEvalMetricFactory, PromptExtractionMixin

DeepEval GEval Metric Integration.

This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.

GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.

Available Fields: - query (str, optional): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted into a list with a single element. - retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalGEvalMetric class.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
evaluation_params list[LLMTestCaseParams]

The evaluation parameters.

required
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
criteria str | None

The criteria to use for the metric. Defaults to None.

None
evaluation_steps list[str] | None

The evaluation steps to use for the metric. Defaults to None.

None
rubric list[Rubric] | None

The rubric to use for the metric. Defaults to None.

None
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
additional_context str | None

Additional context like few-shot examples. Defaults to None.

None

get_full_prompt(data)

Get the full prompt that DeepEval generates for this metric.

Parameters:

Name Type Description Default
data MetricInput

The metric input.

required

Returns:

Name Type Description
str str

The complete prompt (system + user) as a string.

DeepEvalHallucinationMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Hallucination Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str]): The expected context to evaluate the metric. Similar to context in LLMTestCaseParams.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more hallucinated, closer to 0.0 means less hallucinated.

Initializes the DeepEvalHallucinationMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalJsonCorrectnessMetric(expected_schema, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval JSON Correctness Metric Integration.

This metric evaluates whether a response is JSON correct according to a specified schema. It helps ensure that AI responses follow the expected JSON structure.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Categorical): - 0: The response is not JSON correct according to the schema. - 1: The response is JSON correct according to the schema.

Initializes the DeepEvalJsonCorrectnessMetric class.

Parameters:

Name Type Description Default
expected_schema Type[BaseModel]

The expected schema class (not instance) for the response. Example: ExampleSchema (the class, not an instance).

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4o".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If expected_schema is not a valid BaseModel class.

DeepEvalMetric(metric, name)

Bases: BaseMetric

DeepEval Metric Integration.

Attributes:

Name Type Description
metric BaseMetric

The DeepEval metric to wrap.

name str

The name of the metric.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted into a list with a single element. - retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.

Initializes the DeepEvalMetric class.

Parameters:

Name Type Description Default
metric BaseMetric

The DeepEval metric to wrap.

required
name str

The name of the metric.

required

DeepEvalMetricFactory(name, model, model_credentials, model_config, **kwargs)

Bases: DeepEvalMetric, ABC

Abstract base class for creating DeepEval metrics with a shared model invoker.

Initializes the metric, handling common model invoker creation.

Parameters:

Name Type Description Default
name str

The name for the metric.

required
model Union[str, ModelId, BaseLMInvoker]

The model identifier or an existing LM invoker instance.

required
model_credentials Optional[str]

Credentials for the model, required if model is a string.

required
model_config Optional[Dict[str, Any]]

Configuration for the model.

required
**kwargs

Additional arguments for the specific DeepEval metric constructor.

{}

DeepEvalMisuseMetric(domain, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Misuse Metric Integration.

This metric evaluates whether a response contains inappropriate misuse of the model. It helps ensure that AI responses don't provide harmful or inappropriate misuse of the model.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more misuse, closer to 0.0 means less misuse.

Initializes the DeepEvalMisuseMetric class.

Parameters:

Name Type Description Default
domain str

The domain to evaluate the metric. Common domains include: "finance", "health", "legal", "personal", "investment".

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4o".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If domain is empty or contains invalid values.

DeepEvalNonAdviceMetric(advice_types, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Non-Advice Metric Integration.

This metric evaluates whether a response contains inappropriate advice types. It helps ensure that AI responses don't provide harmful or inappropriate advice.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 0.0 means more inappropriate advice, closer to 1.0 means less inappropriate advice.

Initializes the DeepEvalNonAdviceMetric class.

Parameters:

Name Type Description Default
advice_types List[str]

List of advice types to detect as inappropriate. Common types include: ["financial", "medical", "legal", "personal", "investment"].

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4o".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If advice_types is empty or contains invalid values.

DeepEvalPIILeakageMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval PII Leakage Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more privacy violations, closer to 0.0 means less privacy violations.

Initializes the DeepEvalPIILeakageMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalPromptAlignmentMetric(prompt_instructions, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Prompt Alignment Metric Integration.

This metric evaluates whether a response is aligned with the prompt instructions. It helps ensure that AI responses are aligned with the prompt instructions.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output inLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more aligned with the prompt instructions, closer to 0.0 means less aligned with the prompt instructions.

Initializes the DeepEvalPromptAlignmentMetric class.

Parameters:

Name Type Description Default
prompt_instructions List[str]

a list of strings specifying the instructions you want followed in your prompt template.

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4o".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If prompt_instructions is empty or contains invalid values.

DeepEvalRoleViolationMetric(role, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Role Violation Metric Integration.

This metric evaluates whether a response contains role violations. It helps ensure that AI responses don't provide harmful or inappropriate role violations.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 0.0 means more role violations, closer to 1.0 means less role violations.

Initializes the DeepEvalRoleViolationMetric class.

Parameters:

Name Type Description Default
role str

The role to evaluate the metric. Common roles include: "helpful customer assistant", "medical insurance agent".

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4o".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If role is empty or contains invalid values.

DeepEvalToxicityMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Toxicity Metric Integration.

Available Fields: - query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.

Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more toxic, closer to 0.0 means less toxic.

Initializes the DeepEvalToxicityMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

GEvalCompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)

Bases: DeepEvalGEvalMetric

GEval Completeness Metric.

This metric is used to evaluate the completeness of the generated output.

Required Fields: - query (str): The query to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output.

Attributes:

Name Type Description
name str

The name of the metric.

model str | ModelId | BaseLMInvoker

The model to use for the metric.

model_credentials str | None

The model credentials to use for the metric.

model_config dict[str, Any] | None

The model config to use for the metric.

Initialize the GEval Completeness Metric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric.

None
model_config dict[str, Any] | None

The model config to use for the metric.

None
criteria str | None

The criteria to use for the metric. default is DEFAULT_CRITERIA

None
evaluation_steps list[str] | None

The evaluation steps to use for the metric. default is DEFAULT_EVALUATION_STEPS

None
rubric list[Rubric] | None

The rubric to use for the metric. default is DEFAULT_RUBRIC

None
threshold float

The threshold to use for the metric. default is 0.5

0.5
evaluation_params list[LLMTestCaseParams] | None

The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT]

None
additional_context str | None

Additional context like few-shot examples. Defaults to None.

None

GEvalGroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)

Bases: DeepEvalGEvalMetric

GEval Groundedness Metric.

This metric is used to evaluate the groundedness of the generated output.

Required Fields: - query (str): The query to evaluate the groundedness of the model's output. - generated_response (str): The generated response to evaluate the groundedness of the model's output. - retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.

Attributes:

Name Type Description
name str

The name of the metric.

model str | ModelId | BaseLMInvoker

The model to use for the metric.

model_credentials str | None

The model credentials to use for the metric.

model_config dict[str, Any] | None

The model config to use for the metric.

Initialize the GEval Groundedness Metric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric.

None
model_config dict[str, Any] | None

The model config to use for the metric.

None
criteria str | None

The criteria to use for the metric. default is DEFAULT_CRITERIA

None
evaluation_steps list[str] | None

The evaluation steps to use for the metric. default is DEFAULT_EVALUATION_STEPS

None
rubric list[Rubric] | None

The rubric to use for the metric. default is DEFAULT_RUBRIC

None
threshold float

The threshold to use for the metric. default is 0.5

0.5
evaluation_params list[LLMTestCaseParams] | None

The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.CONTEXT]

None
additional_context str | None

Additional context like few-shot examples. Defaults to None.

None

GEvalRedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)

Bases: DeepEvalGEvalMetric

GEval Redundancy Metric.

This metric is used to evaluate the redundancy of the generated output.

Required Fields: - query (str): The query to evaluate the redundancy of the model's output. - generated_response (str): The generated response to evaluate the redundancy of the model's output.

Attributes:

Name Type Description
name str

The name of the metric.

model str | ModelId | BaseLMInvoker

The model to use for the metric.

model_credentials str | None

The model credentials to use for the metric.

model_config dict[str, Any] | None

The model config to use for the metric.

Initialize the GEval Redundancy Metric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric.

None
model_config dict[str, Any] | None

The model config to use for the metric.

None
criteria str | None

The criteria to use for the metric. default is REDUNDANCY_CRITERIA

None
evaluation_steps list[str] | None

The evaluation steps to use for the metric. default is REDUNDANCY_EVALUATION_STEPS

None
rubric list[Rubric] | None

The rubric to use for the metric. default is REDUNDANCY_RUBRIC

None
threshold float

The threshold to use for the metric. default is 0.5

0.5
evaluation_params list[LLMTestCaseParams] | None

The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]

None
additional_context str | None

Additional context like few-shot examples. Defaults to None.

None

GroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)

Bases: LMBasedMetric

Groundedness metric.

Attributes:

Name Type Description
name str

The name of the metric.

response_schema ResponseSchema

The response schema to use for the metric.

prompt_builder PromptBuilder

The prompt builder to use for the metric.

model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

model_credentials str

The model credentials to use for the metric.

model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

Initialize the GroundednessMetric class.

Default expected input: - query (str): The query to evaluate the groundedness of the model's output. - retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output. - generated_response (str): The generated response to evaluate the groundedness of the model's output.

Parameters:

Name Type Description Default
model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

MODEL
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

None
model_credentials str | None

The model credentials to use for the metric. Defaults to None.

None
prompt_builder PromptBuilder | None

The prompt builder to use for the metric. Defaults to default prompt builder.

None
response_schema ResponseSchema | None

The response schema to use for the metric. Defaults to GroundednessResponseSchema.

None

LMBasedMetric(name, response_schema, prompt_builder, model=DefaultValues.MODEL, model_credentials=None, model_config=None, parse_response_fn=None)

Bases: BaseMetric

A multi purpose LM-based metric class.

This class is a multi purpose LM-based metric class that can be used to evaluate the performance of a LM-based metric. It can be used to evaluate the performance of a LM-based metric by providing a response schema, a prompt builder, a model id, and a model credentials.

Attributes:

Name Type Description
name str

The name of the metric.

response_schema ResponseSchema

The response schema to use for the metric.

prompt_builder PromptBuilder

The prompt builder to use for the metric.

model_credentials str

The model credentials to use for the metric.

model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty.

Initialize the LMBasedMetric class.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
response_schema ResponseSchema

The response schema to use for the metric.

required
prompt_builder PromptBuilder

The prompt builder to use for the metric.

required
model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

None
parse_response_fn Callable[[str | LMOutput], MetricOutput] | None

The function to use to parse the response from the LM. Defaults to a function that parses the response from the LM.

None

LangChainAgentEvalsLLMAsAJudgeMetric(name, prompt, model, credentials=None, config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainAgentEvalsMetric

A metric that uses LangChain AgentEvals to evaluate Agent as a judge.

Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.

Attributes:

Name Type Description
name str

The name of the metric.

evaluator SimpleAsyncEvaluator

The evaluator to use.

Initialize the LangChainAgentEvalsLLMAsAJudgeMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
prompt str

The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages. Note that the default prompt allows a rubric in addition to the typical "inputs", "outputs", and "reference_outputs" parameters.

required
model str | ModelId | BaseLMInvoker

The model to use.

required
credentials str | None

The credentials to use for the model. Defaults to None.

None
config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "trajectory_accuracy".

'trajectory_accuracy'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

LangChainAgentEvalsMetric(name, evaluator)

Bases: BaseMetric

A metric that uses LangChain AgentEvals to evaluate Agent.

Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.

Attributes:

Name Type Description
name str

The name of the metric.

evaluator SimpleAsyncEvaluator

The evaluator to use.

Initialize the LangChainAgentEvalsMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
evaluator SimpleAsyncEvaluator

The evaluator to use.

required

LangChainAgentTrajectoryAccuracyMetric(model, prompt=None, model_credentials=None, model_config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, use_reference=True)

Bases: LangChainAgentEvalsLLMAsAJudgeMetric

A metric that uses LangChain AgentEvals to evaluate the trajectory accuracy of the agent.

Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory.

Attributes:

Name Type Description
name str

The name of the metric.

evaluator SimpleAsyncEvaluator

The evaluator to use.

Initialize the LangChainAgentTrajectoryAccuracyMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

required
prompt str | None

The prompt to use. Defaults to None.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "trajectory_accuracy".

'trajectory_accuracy'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None
use_reference bool

If True, uses the expected agent trajectory to evaluate the trajectory accuracy. Defaults to True. If False, the TRAJECTORY_ACCURACY_CUSTOM_PROMPT is used to evaluate the trajectory accuracy. If custom prompt is provided, this parameter will be ignored.

True

LangChainConcisenessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the conciseness of the LLM.

Required Fields: - query (str): The query to evaluate the conciseness of. - generated_response (str): The generated response to evaluate the conciseness of.

Attributes:

Name Type Description
name str

The name of the metric.

prompt str

The prompt to use.

model str | ModelId | BaseLMInvoker

The model to use.

Initialize the LangChainConcisenessMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

LangChainCorrectnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the correctness of the LLM.

Required Fields: - query (str): The query to evaluate the correctness of. - generated_response (str): The generated response to evaluate the correctness of. - expected_response (str): The expected response to evaluate the correctness of.

Attributes:

Name Type Description
name str

The name of the metric.

prompt str

The prompt to use.

model str | ModelId | BaseLMInvoker

The model to use.

Initialize the LangChainCorrectnessMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

LangChainGroundednessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the groundedness of the LLM.

Required Fields: - generated_response (str | list[str]): The generated response to evaluate the groundedness of. - retrieved_context (str | list[str]): The retrieved context to evaluate the groundedness of.

Attributes:

Name Type Description
name str

The name of the metric.

prompt str

The prompt to use.

model str | ModelId | BaseLMInvoker

The model to use.

Initialize the LangChainGroundednessMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

LangChainHallucinationMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the hallucination of the LLM.

Required Fields: - query (str): The query to evaluate the hallucination of. - generated_response (str): The generated response to evaluate the hallucination of. - expected_retrieved_context (str): The expected retrieved context to evaluate the hallucination of. - expected_response (str, optional): Additional information to help the model evaluate the hallucination.

Attributes:

Name Type Description
name str

The name of the metric.

prompt str

The prompt to use.

model str | ModelId | BaseLMInvoker

The model to use.

Initialize the LangChainHallucinationMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

LangChainHelpfulnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

A metric that uses LangChain and OpenEvals to evaluate the helpfulness of the LLM.

Required Fields: - query (str): The query to evaluate the helpfulness of. - generated_response (str): The generated response to evaluate the helpfulness of.

Attributes:

Name Type Description
name str

The name of the metric.

prompt str

The prompt to use.

model str | ModelId | BaseLMInvoker

The model to use.

Initialize the LangChainHelpfulnessMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

LangChainOpenEvalsLLMAsAJudgeMetric(name, prompt, model, system=None, credentials=None, config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainOpenEvalsMetric

A metric that uses LangChain and OpenEvals to evaluate the LLM as a judge.

Attributes:

Name Type Description
name str

The name of the metric.

prompt str

The prompt to use.

model str | ModelId | BaseLMInvoker

The model to use.

Initialize the LangChainOpenEvalsLLMAsAJudgeMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
prompt str

The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages.

required
model str | ModelId | BaseLMInvoker

The model to use.

required
system str | None

Optional system message to prepend to the prompt.

None
credentials str | None

The credentials to use for the model. Defaults to None.

None
config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

LangChainOpenEvalsMetric(name, evaluator)

Bases: BaseMetric

A metric that uses LangChain and OpenEvals.

Attributes:

Name Type Description
name str

The name of the metric.

evaluator Union[SimpleAsyncEvaluator, Callable[..., Awaitable[Any]]]

The evaluator to use.

Initialize the LangChainOpenEvalsMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
evaluator Union[SimpleAsyncEvaluator, Callable[..., Awaitable[Any]]]

The evaluator to use.

required

PyTrecMetric(metrics=None, k=20)

Bases: BaseMetric

Pytrec_eval metric.

Required fields: - retrieved_chunks: The retrieved chunk ids with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.

Example:

data = RetrievalData(
    retrieved_chunks={
        "chunk1": 0.9,
        "chunk2": 0.8,
        "chunk3": 0.7,
    },
    ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
metric = PyTrecMetric()
await metric.evaluate(data)

Attributes:

Name Type Description
name str

The name of the metric.

metrics list[PyTrecEvalMetric | str] | set[PyTrecEvalMetric | str] | None

The metrics to evaluate.

k_values int | list[int]

The number of retrieved chunks to consider.

Initializes the PyTrecMetric.

Parameters:

Name Type Description Default
metrics list[PyTrecEvalMetric | str] | set[PyTrecEvalMetric | str] | None

The metrics to evaluate. Defaults to all metrics.

None
k int | list[int]

The number of retrieved chunks to consider. Defaults to 20.

20

RAGASMetric(metric, name=None, callbacks=None, timeout=None)

Bases: BaseMetric

RAGAS metric.

RAGAS is a metric for evaluating the quality of RAG systems.

Attributes:

Name Type Description
metric SingleTurnMetric

The Ragas metric to use.

name str

The name of the metric.

callbacks Callbacks

The callbacks to use.

timeout int

The timeout for the metric.

Available Fields: - query (str): The query to evaluate the metric. Similar to user_input in SingleTurnSample. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to response in SingleTurnSample. If the generated response is a list, the responses are concatenated into a single string. For multiple responses, use list[str]. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to reference in SingleTurnSample. If the expected response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to reference_contexts in SingleTurnSample. If the expected retrieved context is a str, it will be converted into a list with a single element. - retrieved_context (str | list[str], optional): The retrieved context to evaluate the metric. Similar to retrieved_contexts in SingleTurnSample. If the retrieved context is a str, it will be converted into a list with a single element. - rubrics (dict[str, str], optional): The rubrics to evaluate the metric. Similar to rubrics in SingleTurnSample.

Initialize the RAGASMetric.

Parameters:

Name Type Description Default
metric SingleTurnMetric

The Ragas metric to use.

required
name str

The name of the metric. Default is the name of the metric.

None
callbacks Callbacks

The callbacks to use. Default is None.

None
timeout int

The timeout for the metric. Default is None.

None

RagasContextPrecisionWithoutReference(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)

Bases: RAGASMetric

RAGAS Context Recall metric.

Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for. - retrieved_contexts (list[str]): The retrieved contexts to recall the context for.

Initialize the RagasContextPrecisionWithoutReference metric.

Parameters:

Name Type Description Default
lm_model str | ModelId | BaseLMInvoker

The language model to use.

required
lm_model_credentials str | None

The credentials to use for the language model. Default is None.

None
lm_model_config dict[str, Any] | None

The configuration to use for the language model. Default is None.

None
**kwargs

Additional keyword arguments to pass to the RagasContextRecall metric.

{}

RagasContextRecall(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)

Bases: RAGASMetric

RAGAS Context Recall metric.

Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for. - expected_response (str): The expected response to recall the context for. - retrieved_contexts (list[str]): The retrieved contexts to recall the context for.

Initialize the RagasContextRecall metric.

Parameters:

Name Type Description Default
lm_model str | ModelId | BaseLMInvoker

The language model to use.

required
lm_model_credentials str | None

The credentials to use for the language model. Default is None.

None
lm_model_config dict[str, Any] | None

The configuration to use for the language model. Default is None.

None
**kwargs

Additional keyword arguments to pass to the RagasContextRecall metric.

{}

RagasFactualCorrectness(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, **kwargs)

Bases: RAGASMetric

RAGAS Factual Correctness metric.

Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for.

Initialize the RagasFactualCorrectness metric.

Parameters:

Name Type Description Default
lm_model str | ModelId | BaseLMInvoker

The language model to use.

MODEL
lm_model_credentials str | None

The credentials to use for the language model. Default is None.

None
lm_model_config dict[str, Any] | None

The configuration to use for the language model. Default is None.

None
**kwargs

Additional keyword arguments to pass to the RagasFactualCorrectness metric.

{}

RedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)

Bases: LMBasedMetric

Redundancy metric.

Attributes:

Name Type Description
name str

The name of the metric.

response_schema ResponseSchema

The response schema to use for the metric.

prompt_builder PromptBuilder

The prompt builder to use for the metric.

model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

model_credentials str

The model credentials to use for the metric.

Initialize the RedundancyMetric class.

Default expected input: - query (str): The query to evaluate the redundancy of the model's output. - generated_response (str): The generated response to evaluate the redundancy of the model's output.

Parameters:

Name Type Description Default
model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

None
prompt_builder PromptBuilder | None

The prompt builder to use for the metric. Defaults to default prompt builder.

None
response_schema ResponseSchema | None

The response schema to use for the metric. Defaults to RedundancyResponseSchema.

None

TopKAccuracy(k=20)

Bases: BaseMetric

Top K Accuracy metric.

Required fields: - retrieved_chunks: The retrieved chunk ids with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.

Example:

data = RetrievalData(
    retrieved_chunks={
        "chunk1": 0.9,
        "chunk2": 0.8,
        "chunk3": 0.7,
    },
    ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
metric = TopKAccuracy()
await metric.evaluate(data)

Attributes:

Name Type Description
name str

The name of the metric.

k_values list[int]

The number of retrieved chunks to consider.

Initializes the TopKAccuracy.

Parameters:

Name Type Description Default
k list[int] | int

The number of retrieved chunks to consider. Defaults to 20.

20

top_k_accuracy(qrels, results)

Evaluates the top k accuracy.

Parameters:

Name Type Description Default
qrels dict[str, dict[str, int]]

The ground truth of the retrieved chunks. There are two possible values:

required
- 1

the chunk is relevant to the query.

required
- 0

the chunk is not relevant to the query.

required
results dict[str, dict[str, float]]

The retrieved chunks with their similarity score.

required

Returns:

Type Description
dict[str, float]

dict[str, float]: The top k accuracy.

Example:

qrels = {
    "q1": {"chunk1": 1, "chunk2": 1},
}
results = {"q1": {"chunk1": 0.9, "chunk2": 0.8, "chunk3": 0.7}}