Metrics
Metrics module.
BaseMetric
Bases: ABC
Abstract class for metrics.
This class defines the interface for all metrics.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
required_fields |
set[str]
|
The required fields for this metric to evaluate data. |
can_evaluate(data)
Check if this metric can evaluate the given data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
MetricInput
|
The input data to check. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the metric can evaluate the data, False otherwise. |
evaluate(data)
async
Evaluate the metric on the given dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
MetricInput
|
The data to evaluate the metric on. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
MetricOutput |
MetricOutput
|
A dictionary where the key are the namespace and the value are the scores. |
get_input_fields()
classmethod
Return declared input field names if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[str] | None
|
list[str] | None: The input fields. |
get_input_spec()
classmethod
Return structured spec for input fields if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | None
|
list[dict[str, Any]] | None: The input spec. |
CompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)
Bases: LMBasedMetric
Completeness metric.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
response_schema |
ResponseSchema
|
The response schema to use for the metric. |
prompt_builder |
PromptBuilder
|
The prompt builder to use for the metric. |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
model_credentials |
str
|
The model credentials to use for the metric. |
Initialize the CompletenessMetric class.
Default expected input: - query (str): The query to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
prompt_builder |
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema |
ResponseSchema | None
|
The response schema to use for the metric. Defaults to CompletenessResponseSchema. |
None
|
DeepEvalAnswerRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Answer Relevancy Metric Integration.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Initializes the DeepEvalAnswerRelevancyMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalBiasMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Bias Metric Integration.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more biased, closer to 0.0 means less biased.
Initializes the DeepEvalBiasMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalContextualPrecisionMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval ContextualPrecision Metric Integration.
Required Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
- expected_response (str | list[str]): The expected response to evaluate the metric. Similar to expected_output in
LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string.
- retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalContextualPrecisionMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalContextualRecallMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval ContextualRecall Metric Integration.
Required Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
- expected_response (str | list[str]): The expected response to evaluate the metric. Similar to expected_output in
LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string.
- retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalContextualRecallMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalContextualRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval ContextualRelevancy Metric Integration.
Required Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
- retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalContextualRelevancyMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalFaithfulnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Faithfulness Metric Integration.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
- retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalFaithfulnessMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalGEvalMetric(name, evaluation_params, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None)
Bases: DeepEvalMetricFactory, PromptExtractionMixin
DeepEval GEval Metric Integration.
This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.
GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.
Available Fields:
- query (str, optional): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to
actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated
into a single string.
- expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to
expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated
into a single string.
- expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric.
Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted
into a list with a single element.
- retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalGEvalMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the metric. |
required |
evaluation_params |
list[LLMTestCaseParams]
|
The evaluation parameters. |
required |
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
criteria |
str | None
|
The criteria to use for the metric. Defaults to None. |
None
|
evaluation_steps |
list[str] | None
|
The evaluation steps to use for the metric. Defaults to None. |
None
|
rubric |
list[Rubric] | None
|
The rubric to use for the metric. Defaults to None. |
None
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
additional_context |
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
get_full_prompt(data)
Get the full prompt that DeepEval generates for this metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data |
MetricInput
|
The metric input. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The complete prompt (system + user) as a string. |
DeepEvalHallucinationMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Hallucination Metric Integration.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
- expected_retrieved_context (str | list[str]): The expected context to evaluate the metric.
Similar to context in LLMTestCaseParams.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more hallucinated, closer to 0.0 means less hallucinated.
Initializes the DeepEvalHallucinationMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalJsonCorrectnessMetric(expected_schema, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval JSON Correctness Metric Integration.
This metric evaluates whether a response is JSON correct according to a specified schema. It helps ensure that AI responses follow the expected JSON structure.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Categorical): - 0: The response is not JSON correct according to the schema. - 1: The response is JSON correct according to the schema.
Initializes the DeepEvalJsonCorrectnessMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected_schema |
Type[BaseModel]
|
The expected schema class (not instance) for the response.
Example: |
required |
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If expected_schema is not a valid BaseModel class. |
DeepEvalMetric(metric, name)
Bases: BaseMetric
DeepEval Metric Integration.
Attributes:
| Name | Type | Description |
|---|---|---|
metric |
BaseMetric
|
The DeepEval metric to wrap. |
name |
str
|
The name of the metric. |
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to
actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated
into a single string.
- expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to
expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated
into a single string.
- expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric.
Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted
into a list with a single element.
- retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list
with a single element.
Initializes the DeepEvalMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric |
BaseMetric
|
The DeepEval metric to wrap. |
required |
name |
str
|
The name of the metric. |
required |
DeepEvalMetricFactory(name, model, model_credentials, model_config, **kwargs)
Bases: DeepEvalMetric, ABC
Abstract base class for creating DeepEval metrics with a shared model invoker.
Initializes the metric, handling common model invoker creation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name for the metric. |
required |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model identifier or an existing LM invoker instance. |
required |
model_credentials |
Optional[str]
|
Credentials for the model, required if |
required |
model_config |
Optional[Dict[str, Any]]
|
Configuration for the model. |
required |
**kwargs |
Additional arguments for the specific DeepEval metric constructor. |
{}
|
DeepEvalMisuseMetric(domain, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Misuse Metric Integration.
This metric evaluates whether a response contains inappropriate misuse of the model. It helps ensure that AI responses don't provide harmful or inappropriate misuse of the model.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more misuse, closer to 0.0 means less misuse.
Initializes the DeepEvalMisuseMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain |
str
|
The domain to evaluate the metric. Common domains include: "finance", "health", "legal", "personal", "investment". |
required |
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If domain is empty or contains invalid values. |
DeepEvalNonAdviceMetric(advice_types, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Non-Advice Metric Integration.
This metric evaluates whether a response contains inappropriate advice types. It helps ensure that AI responses don't provide harmful or inappropriate advice.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 0.0 means more inappropriate advice, closer to 1.0 means less inappropriate advice.
Initializes the DeepEvalNonAdviceMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
advice_types |
List[str]
|
List of advice types to detect as inappropriate. Common types include: ["financial", "medical", "legal", "personal", "investment"]. |
required |
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If advice_types is empty or contains invalid values. |
DeepEvalPIILeakageMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval PII Leakage Metric Integration.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more privacy violations, closer to 0.0 means less privacy violations.
Initializes the DeepEvalPIILeakageMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalPromptAlignmentMetric(prompt_instructions, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Prompt Alignment Metric Integration.
This metric evaluates whether a response is aligned with the prompt instructions. It helps ensure that AI responses are aligned with the prompt instructions.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric.
Similar to actual_output inLLMTestCaseParams. If the generated response is a list,
the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more aligned with the prompt instructions, closer to 0.0 means less aligned with the prompt instructions.
Initializes the DeepEvalPromptAlignmentMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_instructions |
List[str]
|
a list of strings specifying the instructions you want followed in your prompt template. |
required |
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If prompt_instructions is empty or contains invalid values. |
DeepEvalRoleViolationMetric(role, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Role Violation Metric Integration.
This metric evaluates whether a response contains role violations. It helps ensure that AI responses don't provide harmful or inappropriate role violations.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 0.0 means more role violations, closer to 1.0 means less role violations.
Initializes the DeepEvalRoleViolationMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
role |
str
|
The role to evaluate the metric. Common roles include: "helpful customer assistant", "medical insurance agent". |
required |
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If role is empty or contains invalid values. |
DeepEvalToxicityMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Toxicity Metric Integration.
Available Fields:
- query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
- generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in
LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring (Continuous): - 0.0-1.0: Scale where closer to 1.0 means more toxic, closer to 0.0 means less toxic.
Initializes the DeepEvalToxicityMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold |
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
'openai/gpt-4.1'
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
GEvalCompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)
Bases: DeepEvalGEvalMetric
GEval Completeness Metric.
This metric is used to evaluate the completeness of the generated output.
Required Fields: - query (str): The query to evaluate the completeness of the model's output. - generated_response (str): The generated response to evaluate the completeness of the model's output. - expected_response (str): The expected response to evaluate the completeness of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
model_credentials |
str | None
|
The model credentials to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
Initialize the GEval Completeness Metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
None
|
criteria |
str | None
|
The criteria to use for the metric. default is DEFAULT_CRITERIA |
None
|
evaluation_steps |
list[str] | None
|
The evaluation steps to use for the metric. default is DEFAULT_EVALUATION_STEPS |
None
|
rubric |
list[Rubric] | None
|
The rubric to use for the metric. default is DEFAULT_RUBRIC |
None
|
threshold |
float
|
The threshold to use for the metric. default is 0.5 |
0.5
|
evaluation_params |
list[LLMTestCaseParams] | None
|
The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT] |
None
|
additional_context |
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
GEvalGroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)
Bases: DeepEvalGEvalMetric
GEval Groundedness Metric.
This metric is used to evaluate the groundedness of the generated output.
Required Fields: - query (str): The query to evaluate the groundedness of the model's output. - generated_response (str): The generated response to evaluate the groundedness of the model's output. - retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
model_credentials |
str | None
|
The model credentials to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
Initialize the GEval Groundedness Metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
None
|
criteria |
str | None
|
The criteria to use for the metric. default is DEFAULT_CRITERIA |
None
|
evaluation_steps |
list[str] | None
|
The evaluation steps to use for the metric. default is DEFAULT_EVALUATION_STEPS |
None
|
rubric |
list[Rubric] | None
|
The rubric to use for the metric. default is DEFAULT_RUBRIC |
None
|
threshold |
float
|
The threshold to use for the metric. default is 0.5 |
0.5
|
evaluation_params |
list[LLMTestCaseParams] | None
|
The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.CONTEXT] |
None
|
additional_context |
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
GEvalRedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, criteria=None, evaluation_steps=None, rubric=None, threshold=0.5, evaluation_params=None, additional_context=None)
Bases: DeepEvalGEvalMetric
GEval Redundancy Metric.
This metric is used to evaluate the redundancy of the generated output.
Required Fields: - query (str): The query to evaluate the redundancy of the model's output. - generated_response (str): The generated response to evaluate the redundancy of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
model_credentials |
str | None
|
The model credentials to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
Initialize the GEval Redundancy Metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. |
None
|
criteria |
str | None
|
The criteria to use for the metric. default is REDUNDANCY_CRITERIA |
None
|
evaluation_steps |
list[str] | None
|
The evaluation steps to use for the metric. default is REDUNDANCY_EVALUATION_STEPS |
None
|
rubric |
list[Rubric] | None
|
The rubric to use for the metric. default is REDUNDANCY_RUBRIC |
None
|
threshold |
float
|
The threshold to use for the metric. default is 0.5 |
0.5
|
evaluation_params |
list[LLMTestCaseParams] | None
|
The evaluation parameters to use for the metric. default is [LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT] |
None
|
additional_context |
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
GroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)
Bases: LMBasedMetric
Groundedness metric.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
response_schema |
ResponseSchema
|
The response schema to use for the metric. |
prompt_builder |
PromptBuilder
|
The prompt builder to use for the metric. |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
model_credentials |
str
|
The model credentials to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
Initialize the GroundednessMetric class.
Default expected input: - query (str): The query to evaluate the groundedness of the model's output. - retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output. - generated_response (str): The generated response to evaluate the groundedness of the model's output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
prompt_builder |
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema |
ResponseSchema | None
|
The response schema to use for the metric. Defaults to GroundednessResponseSchema. |
None
|
LMBasedMetric(name, response_schema, prompt_builder, model=DefaultValues.MODEL, model_credentials=None, model_config=None, parse_response_fn=None)
Bases: BaseMetric
A multi purpose LM-based metric class.
This class is a multi purpose LM-based metric class that can be used to evaluate the performance of a LM-based metric. It can be used to evaluate the performance of a LM-based metric by providing a response schema, a prompt builder, a model id, and a model credentials.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
response_schema |
ResponseSchema
|
The response schema to use for the metric. |
prompt_builder |
PromptBuilder
|
The prompt builder to use for the metric. |
model_credentials |
str
|
The model credentials to use for the metric. |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty. |
Initialize the LMBasedMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the metric. |
required |
response_schema |
ResponseSchema
|
The response schema to use for the metric. |
required |
prompt_builder |
PromptBuilder
|
The prompt builder to use for the metric. |
required |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
parse_response_fn |
Callable[[str | LMOutput], MetricOutput] | None
|
The function to use to parse the response from the LM. Defaults to a function that parses the response from the LM. |
None
|
LangChainAgentEvalsLLMAsAJudgeMetric(name, prompt, model, credentials=None, config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainAgentEvalsMetric
A metric that uses LangChain AgentEvals to evaluate Agent as a judge.
Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
evaluator |
SimpleAsyncEvaluator
|
The evaluator to use. |
Initialize the LangChainAgentEvalsLLMAsAJudgeMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the metric. |
required |
prompt |
str
|
The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages. Note that the default prompt allows a rubric in addition to the typical "inputs", "outputs", and "reference_outputs" parameters. |
required |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
required |
credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "trajectory_accuracy". |
'trajectory_accuracy'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainAgentEvalsMetric(name, evaluator)
Bases: BaseMetric
A metric that uses LangChain AgentEvals to evaluate Agent.
Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
evaluator |
SimpleAsyncEvaluator
|
The evaluator to use. |
Initialize the LangChainAgentEvalsMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the metric. |
required |
evaluator |
SimpleAsyncEvaluator
|
The evaluator to use. |
required |
LangChainAgentTrajectoryAccuracyMetric(model, prompt=None, model_credentials=None, model_config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, use_reference=True)
Bases: LangChainAgentEvalsLLMAsAJudgeMetric
A metric that uses LangChain AgentEvals to evaluate the trajectory accuracy of the agent.
Available Fields: - agent_trajectory (list[dict[str, Any]]): The agent trajectory. - expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
evaluator |
SimpleAsyncEvaluator
|
The evaluator to use. |
Initialize the LangChainAgentTrajectoryAccuracyMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
required |
prompt |
str | None
|
The prompt to use. Defaults to None. |
None
|
model_credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "trajectory_accuracy". |
'trajectory_accuracy'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
use_reference |
bool
|
If True, uses the expected agent trajectory to evaluate the trajectory accuracy. Defaults to True. If False, the TRAJECTORY_ACCURACY_CUSTOM_PROMPT is used to evaluate the trajectory accuracy. If custom prompt is provided, this parameter will be ignored. |
True
|
LangChainConcisenessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
A metric that uses LangChain and OpenEvals to evaluate the conciseness of the LLM.
Required Fields: - query (str): The query to evaluate the conciseness of. - generated_response (str): The generated response to evaluate the conciseness of.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
prompt |
str
|
The prompt to use. |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
Initialize the LangChainConcisenessMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system |
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainCorrectnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
A metric that uses LangChain and OpenEvals to evaluate the correctness of the LLM.
Required Fields: - query (str): The query to evaluate the correctness of. - generated_response (str): The generated response to evaluate the correctness of. - expected_response (str): The expected response to evaluate the correctness of.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
prompt |
str
|
The prompt to use. |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
Initialize the LangChainCorrectnessMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system |
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainGroundednessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
A metric that uses LangChain and OpenEvals to evaluate the groundedness of the LLM.
Required Fields: - generated_response (str | list[str]): The generated response to evaluate the groundedness of. - retrieved_context (str | list[str]): The retrieved context to evaluate the groundedness of.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
prompt |
str
|
The prompt to use. |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
Initialize the LangChainGroundednessMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system |
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainHallucinationMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
A metric that uses LangChain and OpenEvals to evaluate the hallucination of the LLM.
Required Fields: - query (str): The query to evaluate the hallucination of. - generated_response (str): The generated response to evaluate the hallucination of. - expected_retrieved_context (str): The expected retrieved context to evaluate the hallucination of. - expected_response (str, optional): Additional information to help the model evaluate the hallucination.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
prompt |
str
|
The prompt to use. |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
Initialize the LangChainHallucinationMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system |
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainHelpfulnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
A metric that uses LangChain and OpenEvals to evaluate the helpfulness of the LLM.
Required Fields: - query (str): The query to evaluate the helpfulness of. - generated_response (str): The generated response to evaluate the helpfulness of.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
prompt |
str
|
The prompt to use. |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
Initialize the LangChainHelpfulnessMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system |
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainOpenEvalsLLMAsAJudgeMetric(name, prompt, model, system=None, credentials=None, config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsMetric
A metric that uses LangChain and OpenEvals to evaluate the LLM as a judge.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
prompt |
str
|
The prompt to use. |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
Initialize the LangChainOpenEvalsLLMAsAJudgeMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the metric. |
required |
prompt |
str
|
The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages. |
required |
model |
str | ModelId | BaseLMInvoker
|
The model to use. |
required |
system |
str | None
|
Optional system message to prepend to the prompt. |
None
|
credentials |
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
config |
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema |
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key |
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous |
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices |
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning |
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples |
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainOpenEvalsMetric(name, evaluator)
Bases: BaseMetric
A metric that uses LangChain and OpenEvals.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
evaluator |
Union[SimpleAsyncEvaluator, Callable[..., Awaitable[Any]]]
|
The evaluator to use. |
Initialize the LangChainOpenEvalsMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name |
str
|
The name of the metric. |
required |
evaluator |
Union[SimpleAsyncEvaluator, Callable[..., Awaitable[Any]]]
|
The evaluator to use. |
required |
PyTrecMetric(metrics=None, k=20)
Bases: BaseMetric
Pytrec_eval metric.
Required fields: - retrieved_chunks: The retrieved chunk ids with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.
Example:
data = RetrievalData(
retrieved_chunks={
"chunk1": 0.9,
"chunk2": 0.8,
"chunk3": 0.7,
},
ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
metric = PyTrecMetric()
await metric.evaluate(data)
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
metrics |
list[PyTrecEvalMetric | str] | set[PyTrecEvalMetric | str] | None
|
The metrics to evaluate. |
k_values |
int | list[int]
|
The number of retrieved chunks to consider. |
Initializes the PyTrecMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
list[PyTrecEvalMetric | str] | set[PyTrecEvalMetric | str] | None
|
The metrics to evaluate. Defaults to all metrics. |
None
|
k |
int | list[int]
|
The number of retrieved chunks to consider. Defaults to 20. |
20
|
RAGASMetric(metric, name=None, callbacks=None, timeout=None)
Bases: BaseMetric
RAGAS metric.
RAGAS is a metric for evaluating the quality of RAG systems.
Attributes:
| Name | Type | Description |
|---|---|---|
metric |
SingleTurnMetric
|
The Ragas metric to use. |
name |
str
|
The name of the metric. |
callbacks |
Callbacks
|
The callbacks to use. |
timeout |
int
|
The timeout for the metric. |
Available Fields:
- query (str): The query to evaluate the metric. Similar to user_input in SingleTurnSample.
- generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to
response in SingleTurnSample. If the generated response is a list, the responses are concatenated into
a single string. For multiple responses, use list[str].
- expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to
reference in SingleTurnSample. If the expected response is a list, the responses are concatenated
into a single string.
- expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric.
Similar to reference_contexts in SingleTurnSample. If the expected retrieved context is a str, it will be
converted into a list with a single element.
- retrieved_context (str | list[str], optional): The retrieved context to evaluate the metric. Similar to
retrieved_contexts in SingleTurnSample. If the retrieved context is a str, it will be converted into a
list with a single element.
- rubrics (dict[str, str], optional): The rubrics to evaluate the metric. Similar to rubrics in
SingleTurnSample.
Initialize the RAGASMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric |
SingleTurnMetric
|
The Ragas metric to use. |
required |
name |
str
|
The name of the metric. Default is the name of the metric. |
None
|
callbacks |
Callbacks
|
The callbacks to use. Default is None. |
None
|
timeout |
int
|
The timeout for the metric. Default is None. |
None
|
RagasContextPrecisionWithoutReference(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)
Bases: RAGASMetric
RAGAS Context Recall metric.
Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for. - retrieved_contexts (list[str]): The retrieved contexts to recall the context for.
Initialize the RagasContextPrecisionWithoutReference metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lm_model |
str | ModelId | BaseLMInvoker
|
The language model to use. |
required |
lm_model_credentials |
str | None
|
The credentials to use for the language model. Default is None. |
None
|
lm_model_config |
dict[str, Any] | None
|
The configuration to use for the language model. Default is None. |
None
|
**kwargs |
Additional keyword arguments to pass to the RagasContextRecall metric. |
{}
|
RagasContextRecall(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)
Bases: RAGASMetric
RAGAS Context Recall metric.
Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for. - expected_response (str): The expected response to recall the context for. - retrieved_contexts (list[str]): The retrieved contexts to recall the context for.
Initialize the RagasContextRecall metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lm_model |
str | ModelId | BaseLMInvoker
|
The language model to use. |
required |
lm_model_credentials |
str | None
|
The credentials to use for the language model. Default is None. |
None
|
lm_model_config |
dict[str, Any] | None
|
The configuration to use for the language model. Default is None. |
None
|
**kwargs |
Additional keyword arguments to pass to the RagasContextRecall metric. |
{}
|
RagasFactualCorrectness(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, **kwargs)
Bases: RAGASMetric
RAGAS Factual Correctness metric.
Required Fields: - query (str): The query to recall the context for. - generated_response (str): The generated response to recall the context for.
Initialize the RagasFactualCorrectness metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lm_model |
str | ModelId | BaseLMInvoker
|
The language model to use. |
MODEL
|
lm_model_credentials |
str | None
|
The credentials to use for the language model. Default is None. |
None
|
lm_model_config |
dict[str, Any] | None
|
The configuration to use for the language model. Default is None. |
None
|
**kwargs |
Additional keyword arguments to pass to the RagasFactualCorrectness metric. |
{}
|
RedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None)
Bases: LMBasedMetric
Redundancy metric.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
response_schema |
ResponseSchema
|
The response schema to use for the metric. |
prompt_builder |
PromptBuilder
|
The prompt builder to use for the metric. |
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
model_credentials |
str
|
The model credentials to use for the metric. |
Initialize the RedundancyMetric class.
Default expected input: - query (str): The query to evaluate the redundancy of the model's output. - generated_response (str): The generated response to evaluate the redundancy of the model's output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model |
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials |
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config |
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
prompt_builder |
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema |
ResponseSchema | None
|
The response schema to use for the metric. Defaults to RedundancyResponseSchema. |
None
|
TopKAccuracy(k=20)
Bases: BaseMetric
Top K Accuracy metric.
Required fields: - retrieved_chunks: The retrieved chunk ids with their similarity score. - ground_truth_chunk_ids: The ground truth chunk ids.
Example:
data = RetrievalData(
retrieved_chunks={
"chunk1": 0.9,
"chunk2": 0.8,
"chunk3": 0.7,
},
ground_truth_chunk_ids=["chunk1", "chunk2", "chunk3"],
)
metric = TopKAccuracy()
await metric.evaluate(data)
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
k_values |
list[int]
|
The number of retrieved chunks to consider. |
Initializes the TopKAccuracy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k |
list[int] | int
|
The number of retrieved chunks to consider. Defaults to 20. |
20
|
top_k_accuracy(qrels, results)
Evaluates the top k accuracy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
qrels |
dict[str, dict[str, int]]
|
The ground truth of the retrieved chunks. There are two possible values: |
required |
- |
1
|
the chunk is relevant to the query. |
required |
- |
0
|
the chunk is not relevant to the query. |
required |
results |
dict[str, dict[str, float]]
|
The retrieved chunks with their similarity score. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
dict[str, float]: The top k accuracy. |
Example:
qrels = {
"q1": {"chunk1": 1, "chunk2": 1},
}
results = {"q1": {"chunk1": 0.9, "chunk2": 0.8, "chunk3": 0.7}}