Metrics
Metrics module for evaluating AI model outputs.
This module provides a comprehensive collection of evaluation metrics for assessing the quality of generated content, retrieval systems, and AI agent responses. It includes both traditional metrics and LLM-based metrics, as well as integrations with popular evaluation frameworks.
Metric categories: - Generation metrics: Evaluate quality of generated text (completeness, groundedness, redundancy, language consistency, refusal alignment) - Retrieval metrics: Assess retrieval system performance (precision, recall, accuracy) - Agent metrics: Evaluate AI agent behavior and responses - Open-source integrations: Wrappers for RAGAS, DeepEval, and LangChain evaluators
BaseMetric
Bases: ABC
Abstract class for metrics.
This class defines the interface for all metrics.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
required_fields |
set[str]
|
The required fields for this metric to evaluate data. |
input_type |
type | None
|
The type of the input data. |
Example
Adding custom prompts to existing evaluator metrics:
from gllm_evals import load_simple_qa_dataset
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.utils.shared_functionality import inference_fn
async def main():
# Main function with custom prompts
# Load your dataset
dataset = load_simple_qa_dataset()
# Create evaluator with default metrics
evaluator = GEvalGenerationEvaluator(
model_credentials=os.getenv("GOOGLE_API_KEY")
)
# Add custom prompts polymorphically (works for any metric)
for metric in evaluator.metrics:
if hasattr(metric, 'name'): # Ensure metric has name attribute
# Add custom prompts based on metric name
if metric.name == "geval_completeness":
# Add domain-specific few-shot examples
metric.additional_context += "\n\nCUSTOM MEDICAL EXAMPLES: ..."
elif metric.name == "geval_groundedness":
# Add grounding examples
metric.additional_context += "\n\nMEDICAL GROUNDING EXAMPLES: ..."
# Evaluate with custom prompts applied automatically
results = await evaluate(
data=dataset,
inference_fn=inference_fn,
evaluators=[evaluator], # ← Custom prompts applied to metrics
)
can_evaluate(data)
Check if this metric can evaluate the given data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput
|
The input data to check. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the metric can evaluate the data, False otherwise. |
evaluate(data)
async
Evaluate the metric on the given dataset (single item or batch).
Automatically handles batch processing by default. Subclasses can override
_evaluate to accept lists for optimized batch processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput | list[MetricInput]
|
The data to evaluate the metric on. Can be a single item or a list for batch processing. |
required |
Returns:
| Type | Description |
|---|---|
MetricOutput | list[MetricOutput]
|
MetricOutput | list[MetricOutput]: A dictionary where the key are the namespace and the value are the scores. Returns a list if input is a list. |
get_input_fields()
classmethod
Return declared input field names if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[str] | None
|
list[str] | None: The input fields. |
get_input_spec()
classmethod
Return structured spec for input fields if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | None
|
list[dict[str, Any]] | None: The input spec. |
get_normalized_score(raw_score)
Normalize raw score to 0-1 range based on metric's good_score and bad_score.
This method handles both
- Different scales (e.g., 1-3 for completeness, 0-1 for language_consistency)
- Inverted scales (e.g., redundancy where lower is better)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_score
|
float
|
The raw score value from the metric evaluation. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
float |
float
|
Normalized score between 0 and 1, where 1 is best and 0 is worst. |
Examples:
>>> # Completeness: good=3, bad=1 (higher is better)
>>> metric.get_normalized_score(2) # Returns 0.5
>>> # Redundancy: good=1, bad=3 (lower is better)
>>> metric.get_normalized_score(2) # Returns 0.5 (inverted)
>>> # Language Consistency: good=1, bad=0 (already 0-1)
>>> metric.get_normalized_score(0.5) # Returns 0.5
CompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: LMBasedMetric
Completeness metric.
This metric is used to evaluate the completeness of the model's output compared to the expected output.
Available Fields
- query (str): The query.
- generated_response (str): The generated response.
- expected_response (str): The expected response.
Scoring
- 1-3 (Continuous): Scale where 1 is not complete, 2 is incomplete, and 3 is complete.
Cookbook Example
Please refer to example_completeness.py in the gen-ai-sdk-cookbook repository.
Initialize the CompletenessMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
prompt_builder
|
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema
|
ResponseSchema | None
|
The response schema to use for the metric. Defaults to CompletenessResponseSchema. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
DeepEvalAnswerRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Answer Relevancy Metric Integration.
This metric uses LLM-as-a-judge to assess whether the output is relevant to the given input.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
- 0-1 (Continuous): A higher score indicates better answer relevancy.
Cookbook Example
Please refer to example_deepeval_answer_relevancy.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalAnswerRelevancyMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalBiasMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Bias Metric Integration.
This metric uses LLM-as-a-judge to assess whether the LLM application's output contains racial, political, or other forms of offensive bias.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
- 0.0-1.0 (Continuous): Scale where closer to 1.0 means more biased, closer to 0.0 means less biased.
Cookbook Example
Please refer to example_deepeval_bias.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalBiasMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
'openai/gpt-4.1'
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalContextualPrecisionMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, evaluation_template=ContextualPrecisionTemplate)
Bases: DeepEvalMetricFactory
DeepEval Contextual Precision Metric.
Evaluates whether the retrieval context that are relevant to the given query are ranked higher than irrelevant ones. A higher score indicates better contextual precision, meaning relevant context chunks appear earlier in the retrieved results.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - expected_response (str | list[str]): The expected response to evaluate the metric. Similar to
expected_outputinLLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_contextinLLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
- 0.0-1.0 (Continuous): A higher score indicates better contextual precision.
Cookbook Example
Please refer to example_deepeval_contextual_precision.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalContextualPrecisionMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
batch_status_check_interval
|
float
|
Interval in seconds between batch status checks. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of batch status check iterations. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
evaluation_template
|
Type[ContextualPrecisionTemplate]
|
The evaluation template to use for the metric. Defaults to ContextualPrecisionTemplate. It is used to generate the reason for the metric. |
ContextualPrecisionTemplate
|
DeepEvalContextualRecallMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, evaluation_template=ContextualRecallTemplate)
Bases: DeepEvalMetricFactory
DeepEval Contextual Recall Metric.
Evaluates the extent to which the retrieved context aligns with the expected output. A higher score indicates better contextual recall, meaning the retrieval system successfully found the information needed to generate the expected response.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - expected_response (str | list[str]): The expected response to evaluate the metric. Similar to
expected_outputinLLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_contextinLLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
- 0.0-1.0 (Continuous): A higher score indicates better contextual recall.
Cookbook Example
Please refer to example_deepeval_contextual_recall.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalContextualRecallMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
batch_status_check_interval
|
float
|
Interval in seconds between batch status checks. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of batch status check iterations. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
evaluation_template
|
Type[ContextualRecallTemplate]
|
The evaluation template to use for the metric. Defaults to ContextualRecallTemplate. |
ContextualRecallTemplate
|
DeepEvalContextualRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Contextual Relevancy Metric.
Evaluates the overall relevance of the information presented in the retrieved context for a given query. A higher score indicates better contextual relevancy, meaning the retrieved context chunks contain less irrelevant or tangential information.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_contextinLLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
- 0.0-1.0 (Continuous): A higher score indicates better contextual relevancy.
Cookbook Example
Please refer to example_deepeval_contextual_relevancy.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalContextualRelevancyMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalFaithfulnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Faithfulness Metric Integration.
This metric uses LLM-as-a-judge to assess whether the answers rely solely on the retrieved context, without hallucinating or providing misinformation.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_contextinLLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
- 0-1 (Continuous): A higher score indicates better faithfulness.
Cookbook Example
Please refer to example_deepeval_faithfulness.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalFaithfulnessMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalGEvalMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalMetricFactory, PromptExtractionMixin
DeepEval GEval Metric Integration.
This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.
GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.
Available Fields
- query (str, optional): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated to a single string. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to
expected_outputinLLMTestCaseParams. If the expected response is a list, the responses are concatenated to a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric.
Similar to
contextinLLMTestCaseParams. If the expected retrieved context is a str, it will be converted to a list with a single element. - retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_contextinLLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
- 0.0-1.0 (Continuous): Or Boolean depending on the DeepEval GEval configuration.
Initializes the DeepEvalGEvalMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str | None
|
The name of the metric. Defaults to None. Required if not provided via _defaults. |
None
|
evaluation_params
|
list[LLMTestCaseParams] | None
|
The evaluation parameters. Defaults to None. Required if not provided via _defaults. |
None
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
criteria
|
str | None
|
The criteria to use for the metric. Defaults to None. |
None
|
evaluation_steps
|
list[str] | None
|
The evaluation steps to use for the metric. Defaults to None. |
None
|
rubric
|
list[Rubric] | None
|
The rubric to use for the metric. Defaults to None. |
None
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
additional_context
|
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
evaluate(data)
async
Evaluate with custom prompt lifecycle support.
Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.
For batch processing, checks for custom prompts and processes accordingly. Currently processes items individually; batch optimization can be added in future.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput | list[MetricInput]
|
Single data item or list of data items to evaluate. |
required |
Returns:
| Type | Description |
|---|---|
MetricOutput | list[MetricOutput]
|
Evaluation results with scores namespaced by metric name. |
get_custom_prompt_base_name()
Get the base name for custom prompt column lookup.
For GEval metrics, removes the 'geval_' prefix to align with CSV column conventions. This fixes Issue #3 by providing polymorphic naming for GEval metrics.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The base name without 'geval_' prefix (e.g., "completeness" instead of "geval_completeness"). |
Example
metric.name = "geval_completeness" metric.get_custom_prompt_base_name() -> "completeness"
CSV columns expected: - fewshot_completeness - fewshot_completeness_mode - evaluation_step_completeness
get_full_prompt(data)
Get the full prompt that DeepEval generates for this metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput
|
The metric input. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The complete prompt (system + user) as a string. |
DeepEvalHallucinationMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Hallucination Metric Integration.
This metric uses LLM-as-a-judge to determine whether the output contains hallucinated or incorrect information based on the retrieved context.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str]): The expected context to evaluate the metric.
Similar to
contextinLLMTestCaseParams.
Scoring
- 0.0-1.0 (Continuous): Scale where closer to 1.0 means more hallucinated, closer to 0.0 means less hallucinated.
Cookbook Example
Please refer to example_deepeval_hallucination.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalHallucinationMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
'openai/gpt-4.1'
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalJsonCorrectnessMetric(expected_schema, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval JSON Correctness Metric Integration.
This metric evaluates whether a response is JSON correct according to a specified schema. It helps ensure that AI responses follow the expected JSON structure.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
- 0.0-1.0 (Categorical): 0.0 means the response is not JSON correct according to the schema, 1.0 means the response is JSON correct according to the schema.
Cookbook Example
Please refer to example_deepeval_json_correctness.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalJsonCorrectnessMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected_schema
|
Type[BaseModel]
|
The expected schema class (not instance) for the response.
Example: |
required |
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If expected_schema is not a valid BaseModel class. |
DeepEvalMetric(metric, name)
Bases: BaseMetric
DeepEval Metric.
A wrapper for DeepEval metrics.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to
expected_outputinLLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric.
Similar to
contextinLLMTestCaseParams. If the expected retrieved context is a str, it will be converted into a list with a single element. - retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar
to
retrieval_contextinLLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
- 0.0-1.0 (Continuous): Or Boolean depending on the DeepEval metric.
Initializes the DeepEvalMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
BaseMetric
|
The DeepEval metric to wrap. |
required |
name
|
str
|
The name of the metric. |
required |
DeepEvalMetricFactory(name, model, model_credentials, model_config, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, **kwargs)
Bases: DeepEvalMetric, ABC
DeepEval Metric Factory.
Abstract base class for creating DeepEval metrics with a shared model invoker.
Available Fields
- (Dynamic): Depends on the specific DeepEval metric being created.
Scoring
- (Dynamic): Depends on the specific DeepEval metric.
Initializes the metric, handling common model invoker creation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name for the metric. |
required |
model
|
Union[str, ModelId, BaseLMInvoker]
|
The model identifier or an existing LM invoker instance. |
required |
model_credentials
|
Optional[str]
|
Credentials for the model, required if |
required |
model_config
|
Optional[Dict[str, Any]]
|
Configuration for the model. |
required |
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
**kwargs
|
Additional arguments for the specific DeepEval metric constructor. |
{}
|
DeepEvalMisuseMetric(domain, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Misuse Metric Integration.
This metric evaluates whether a response contains inappropriate misuse of the model. It helps ensure that AI responses don't provide harmful or inappropriate misuse of the model.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
- 0.0-1.0 (Continuous): Scale where closer to 1.0 means more misuse, closer to 0.0 means less misuse.
Cookbook Example
Please refer to example_deepeval_misuse.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalMisuseMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
str
|
The domain to evaluate the metric. Common domains include: "finance", "health", "legal", "personal", "investment". |
required |
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If domain is empty or contains invalid values. |
DeepEvalNonAdviceMetric(advice_types, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Non-Advice Metric Integration.
This metric evaluates whether a response contains inappropriate advice types. It helps ensure that AI responses don't provide harmful or inappropriate advice.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
- 0.0-1.0 (Continuous): Scale where closer to 0.0 means more inappropriate advice, closer to 1.0 means less inappropriate advice.
Cookbook Example
Please refer to example_deepeval_non_advice.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalNonAdviceMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
advice_types
|
List[str]
|
List of advice types to detect as inappropriate. Common types include: ["financial", "medical", "legal", "personal", "investment"]. |
required |
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If advice_types is empty or contains invalid values. |
DeepEvalPIILeakageMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval PII Leakage Metric Integration.
This metric uses LLM-as-a-judge to assess whether the LLM application's output contains leaked PII.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
- 0.0-1.0 (Continuous): Scale where closer to 1.0 means more privacy violations, closer to 0.0 means less privacy violations.
Cookbook Example
Please refer to example_deepeval_pii_leakage.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalPIILeakageMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
'openai/gpt-4.1'
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
DeepEvalPromptAlignmentMetric(prompt_instructions, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Prompt Alignment Metric Integration.
This metric evaluates whether a response is aligned with the prompt instructions. It helps ensure that AI responses are aligned with the prompt instructions.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric.
Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
- 0.0-1.0 (Continuous): Scale where closer to 1.0 means more aligned with the prompt instructions, closer to 0.0 means less aligned with the prompt instructions.
Cookbook Example
Please refer to example_deepeval_prompt_alignment.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalPromptAlignmentMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prompt_instructions
|
List[str]
|
a list of strings specifying the instructions you want followed in your prompt template. |
required |
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If prompt_instructions is empty or contains invalid values. |
DeepEvalRoleViolationMetric(role, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Role Violation Metric Integration.
This metric evaluates whether a response contains role violations. It helps ensure that AI responses don't provide harmful or inappropriate role violations.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
- 0.0-1.0 (Continuous): Scale where closer to 0.0 means more role violations, closer to 1.0 means less role violations.
Cookbook Example
Please refer to example_deepeval_role_violation.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalRoleViolationMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
role
|
str
|
The role to evaluate the metric. Common roles include: "helpful customer assistant", "medical insurance agent". |
required |
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4o". |
'openai/gpt-4.1'
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If role is empty or contains invalid values. |
DeepEvalToolCorrectnessMetric(threshold=0.5, model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalMetricFactory
DeepEval Tool Correctness Metric.
This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.
Available Fields
- query (str): The input query.
- generated_response (str, optional): The actual output/response.
- expected_response (str, optional): The expected output/response.
- tools_called (list[dict], optional): The tools actually called by the agent. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from agent_trajectory.
- expected_tools (list[dict], optional): The expected tools to be called. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory.
- agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field.
- expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory.
- available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.
Scoring
- 0.0-1.0 (Continuous): Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.
Cookbook Example
Please refer to example_deepeval_tool_correctness.py in the gen-ai-sdk-cookbook repository.
Initializes DeepEvalToolCorrectnessMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. Also used as good_score for BaseEvaluator's global explanation generation. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
AGENT_EVALS_MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
include_reason
|
bool
|
Include reasoning in output. Defaults to True. |
True
|
strict_mode
|
bool
|
Binary mode (0 or 1). Defaults to False. |
False
|
should_exact_match
|
bool
|
Require exact match of tools. Defaults to False. |
False
|
should_consider_ordering
|
bool
|
Consider order of tools called. Defaults to False. |
False
|
available_tools
|
list[dict[str, Any]] | None
|
List of available tool definitions for context. Each tool should have 'name', 'description', and 'parameters'. Defaults to None. |
None
|
evaluation_params
|
list[ToolCallParams] | None
|
List of strictness criteria for tool correctness. Available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT. Defaults to [ToolCallParams.INPUT_PARAMETERS, ToolCallParams.OUTPUT] to validate both. |
None
|
batch_status_check_interval
|
float
|
Interval in seconds between batch status checks. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of batch status check iterations. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
DeepEvalToxicityMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)
Bases: DeepEvalMetricFactory
DeepEval Toxicity Metric Integration.
This metric uses LLM-as-a-judge to assess whether a response contains toxic content.
Available Fields
- query (str): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str]): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
- 0.0-1.0 (Continuous): Scale where closer to 1.0 means more toxic, closer to 0.0 means less toxic.
Cookbook Example
Please refer to example_deepeval_toxicity.py in the gen-ai-sdk-cookbook repository.
Initializes the DeepEvalToxicityMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
'openai/gpt-4.1'
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
GEvalCompletenessMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalGEvalMetric
GEval Completeness Metric.
This metric is used to evaluate the completeness of the generated output.
Available Fields
- query (str): The query to evaluate the completeness of the model's output.
- generated_response (str): The generated response to evaluate the completeness of the model's output.
- expected_response (str): The expected response to evaluate the completeness of the model's output.
Scoring
- 1-3 (Continuous): Scale where 1 means not complete, 2 means incomplete, and 3 means complete.
Cookbook Example
Please refer to example_geval_completeness.py in the gen-ai-sdk-cookbook repository.
GEvalContextSufficiencyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalGEvalMetric
GEval Context Sufficiency Metric.
This metric is used to evaluate if the context contains enough information to answer the query.
Available Fields
- query (str): The query to evaluate.
- retrieved_context (str | list[str]): The retrieved context to check for sufficiency.
Scoring
- 0-1 (Boolean): Where 0 means insufficient context and 1 means sufficient context.
Cookbook Example
Please refer to example_geval_context_sufficiency.py in the gen-ai-sdk-cookbook repository.
GEvalGroundednessMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalGEvalMetric
GEval Groundedness Metric.
This metric is used to evaluate the groundedness of the generated output.
Available Fields
- query (str): The query to evaluate the groundedness of the model's output.
- generated_response (str): The generated response to evaluate the groundedness of the model's output.
- retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
Scoring
- 1-3 (Continuous): Scale where 1 means not grounded, 2 means at least one grounded, and 3 means fully grounded.
Cookbook Example
Please refer to example_geval_groundedness.py in the gen-ai-sdk-cookbook repository.
GEvalLanguageConsistencyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalGEvalMetric
GEval Language Consistency Metric.
This metric is used to predict if the question and generated response is a language consistency response.
Available Fields
- query (str): The query to predict if it is a language consistency response.
- generated_response (str): The generated response to predict if it is a language consistency response.
Scoring
- 0-1 (Categorical): 0 means not consistent, 1 means fully consistent.
Cookbook Example
Please refer to example_geval_language_consistency.py in the gen-ai-sdk-cookbook repository.
GEvalRedundancyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalGEvalMetric
GEval Redundancy Metric.
This metric is used to evaluate the redundancy of the generated output.
Available Fields
- query (str): The query to evaluate the redundancy of the model's output.
- generated_response (str): The generated response to evaluate the redundancy of the model's output.
Scoring
- 1-3 (Continuous): Scale where 1 means no redundancy, 2 is at least one redundancy, and 3 is high redundancy.
Cookbook Example
Please refer to example_geval_redundancy.py in the gen-ai-sdk-cookbook repository.
get_normalized_score(raw_score)
Normalize raw score to 0-1 range.
For redundancy
- Score ≤ 2: Good (normalized to 1.0)
- Score ≥ 3: Bad (normalized to 0.0)
This override handles scores outside the [good_score, bad_score] range that would otherwise produce values outside 0-1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_score
|
float
|
The raw score value from the metric evaluation (1-3). |
required |
Returns:
| Name | Type | Description |
|---|---|---|
float |
float
|
Normalized score between 0 and 1, where 1 is best and 0 is worst. |
GEvalRefusalAlignmentMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalGEvalMetric
GEval Refusal Alignment Metric.
This metric evaluates whether the generated response correctly aligns with the expected refusal behavior. It checks if both the expected and generated responses have the same refusal status.
Available Fields
- query (str): The query to evaluate the metric.
- expected_response (str): The expected response to evaluate the metric.
- generated_response (str): The generated response to evaluate the metric.
- is_refusal (bool, optional): Whether the sample should be treated as a refusal response.
Scoring
- 0-1 (Categorical): 0 indicates incorrect alignment, 1 indicates correct alignment.
Cookbook Example
Please refer to example_geval_refusal_alignment.py in the gen-ai-sdk-cookbook repository.
GEvalRefusalMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalGEvalMetric
GEval Refusal Metric.
This metric is used to predict if the question and expected response is a refusal response.
Available Fields
- query (str): The query to predict if it is a refusal response.
- expected_response (str): The expected response to predict if it is a refusal response.
Scoring
- 0-1 (Categorical): 0 means not refusal, 1 means refusal.
Cookbook Example
Please refer to example_geval_refusal.py in the gen-ai-sdk-cookbook repository.
GEvalSummarizationCoherenceMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: GEvalSummarizationBaseMetric
GEval Summarization Coherence metric.
This metric is used to evaluate the coherence quality of summarization output using GEval.
Available Fields
- input (str): Source text or transcript.
- summary (str): Generated summary.
Scoring
- 1-3 (Continuous): A higher score indicates better coherence.
Cookbook Example
Please refer to example_geval_summarization_coherence.py in the gen-ai-sdk-cookbook repository.
GEvalSummarizationConsistencyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: GEvalSummarizationBaseMetric
GEval Summarization Consistency metric.
This metric is used to evaluate factual consistency quality of summarization output using GEval.
Available Fields
- input (str): Source text or transcript.
- summary (str): Generated summary.
Scoring
- 1-3 (Continuous): A higher score indicates better consistency.
Cookbook Example
Please refer to example_geval_summarization_consistency.py in the gen-ai-sdk-cookbook repository.
GEvalSummarizationFluencyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: GEvalSummarizationBaseMetric
GEval Summarization Fluency metric.
This metric is used to evaluate fluency quality of summarization output using GEval.
Available Fields
- input (str): Source text or transcript.
- summary (str): Generated summary.
Scoring
- 1-3 (Continuous): A higher score indicates better fluency.
Cookbook Example
Please refer to example_geval_summarization_fluency.py in the gen-ai-sdk-cookbook repository.
GEvalSummarizationRelevanceMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: GEvalSummarizationBaseMetric
GEval Summarization Relevance metric.
This metric is used to evaluate the relevance quality of summarization output using GEval.
Available Fields
- input (str): Source text or transcript.
- summary (str): Generated summary.
Scoring
- 1-3 (Continuous): A higher score indicates better relevance.
Cookbook Example
Please refer to example_geval_summarization_relevance.py in the gen-ai-sdk-cookbook repository.
GroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: LMBasedMetric
Groundedness metric.
This metric is used to evaluate how grounded the generated response is based on the retrieved context.
Available Fields
- query (str): The query to evaluate.
- generated_response (str): The generated response to evaluate.
- retrieved_context (str): The retrieved context to evaluate.
Scoring
- 1-3 (Continuous): Scale where 1 is not grounded, 2 is at least one grounded, and 3 is fully grounded.
Cookbook Example
Please refer to example_groundedness.py in the gen-ai-sdk-cookbook repository.
Initialize the GroundednessMetric class.
Default expected input: - query (str): The query to evaluate the groundedness of the model's output. - retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output. - generated_response (str): The generated response to evaluate the groundedness of the model's output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
prompt_builder
|
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema
|
ResponseSchema | None
|
The response schema to use for the metric. Defaults to GroundednessResponseSchema. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
LMBasedMetric(name, response_schema, prompt_builder, model=DefaultValues.MODEL, model_credentials=None, model_config=None, parse_response_fn=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: BaseMetric
A multi purpose LM-based metric class.
This class is a multi purpose LM-based metric class that can be used to evaluate the performance of a LM-based metric. It can be used to evaluate the performance of a LM-based metric by providing a response schema, a prompt builder, a model id, and a model credentials.
Available Fields
- (Dynamic): Depends on the
prompt_builderand specific metric implementation.
Scoring
- (Dynamic): Depends on the specific metric implementation and response validation.
Initialize the LMBasedMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the metric. |
required |
response_schema
|
ResponseSchema
|
The response schema to use for the metric. |
required |
prompt_builder
|
PromptBuilder
|
The prompt builder to use for the metric. |
required |
model
|
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
parse_response_fn
|
Callable[[str | LMOutput], MetricOutput] | None
|
The function to use to parse the response from the LM. Defaults to a function that parses the response from the LM. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval). |
BATCH_MAX_ITERATIONS
|
evaluate(data)
async
Evaluate with custom prompt lifecycle support.
Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.
For batch processing, uses efficient batch API when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput | list[MetricInput]
|
Single data item or list of data items to evaluate. |
required |
Returns:
| Type | Description |
|---|---|
MetricOutput | list[MetricOutput]
|
Evaluation results with scores namespaced by metric name. |
LangChainAgentEvalsLLMAsAJudgeMetric(name, prompt, model, credentials=None, config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: LangChainAgentEvalsMetric
LangChain AgentEvals LLM as a Judge Metric.
A metric that uses LangChain AgentEvals to evaluate Agent as a judge.
Available Fields
- agent_trajectory (list[dict[str, Any]]): The agent trajectory.
- expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.
Scoring
- 0.0-1.0 (Continuous): An evaluation score assigned by the LLM judge based on the trajectory.
Initialize the LangChainAgentEvalsLLMAsAJudgeMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the metric. |
required |
prompt
|
str
|
The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages. Note that the default prompt allows a rubric in addition to the typical "inputs", "outputs", and "reference_outputs" parameters. |
required |
model
|
str | ModelId | BaseLMInvoker
|
The model to use. |
required |
credentials
|
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
config
|
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema
|
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key
|
str
|
Key used to store the evaluation result, defaults to "trajectory_accuracy". |
'trajectory_accuracy'
|
continuous
|
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices
|
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning
|
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples
|
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
batch_status_check_interval
|
float
|
Interval in seconds between batch status checks. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of batch status check iterations. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
evaluate(data)
async
Evaluate with custom prompt lifecycle support.
Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.
For batch processing, uses efficient batch API when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput | list[MetricInput]
|
Single data item or list of data items to evaluate. |
required |
Returns:
| Type | Description |
|---|---|
MetricOutput | list[MetricOutput]
|
Evaluation results with scores namespaced by metric name. |
LangChainAgentEvalsMetric(name, evaluator)
Bases: BaseMetric
LangChain AgentEvals Metric.
A metric that uses LangChain AgentEvals to evaluate Agent.
Available Fields
- agent_trajectory (list[dict[str, Any]]): The agent trajectory.
- expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.
Scoring
- 0.0-1.0 (Continuous): An evaluation score based on the trajectory.
Initialize the LangChainAgentEvalsMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the metric. |
required |
evaluator
|
SimpleAsyncEvaluator
|
The evaluator to use. |
required |
LangChainAgentTrajectoryAccuracyMetric(model, prompt=None, model_credentials=None, model_config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, use_reference=True, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: LangChainAgentEvalsLLMAsAJudgeMetric
LangChain Agent Trajectory Accuracy Metric.
A metric that uses LangChain AgentEvals to evaluate the trajectory accuracy of the agent.
Available Fields
- agent_trajectory (list[dict[str, Any]]): The agent trajectory.
- expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory.
Scoring
- 0-1 (Continuous/Boolean): A higher score indicates better trajectory accuracy.
Cookbook Example
Please refer to example_langchain_agent_trajectory_accuracy.py in the gen-ai-sdk-cookbook repository.
Initialize the LangChainAgentTrajectoryAccuracyMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str | ModelId | BaseLMInvoker
|
The model to use. |
required |
prompt
|
str | None
|
The prompt to use. Defaults to None. |
None
|
model_credentials
|
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config
|
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema
|
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key
|
str
|
Key used to store the evaluation result, defaults to "trajectory_accuracy". |
'trajectory_accuracy'
|
continuous
|
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices
|
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning
|
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples
|
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
use_reference
|
bool
|
If True, uses the expected agent trajectory to evaluate the trajectory accuracy. Defaults to True. If False, the TRAJECTORY_ACCURACY_CUSTOM_PROMPT is used to evaluate the trajectory accuracy. If custom prompt is provided, this parameter will be ignored. |
True
|
batch_status_check_interval
|
float
|
Interval in seconds between batch status checks. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of batch status check iterations. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
LangChainConcisenessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
LangChain Conciseness Metric.
A metric that uses LangChain and OpenEvals to evaluate the conciseness of the LLM.
Available Fields
- query (str): The query to evaluate the conciseness of.
- generated_response (str): The generated response to evaluate the conciseness of.
Scoring
- 0-1 (Continuous/Boolean): A higher score indicates better conciseness.
Cookbook Example
Please refer to example_langchain_conciseness.py in the gen-ai-sdk-cookbook repository.
Initialize the LangChainConcisenessMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system
|
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials
|
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config
|
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema
|
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key
|
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous
|
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices
|
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning
|
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples
|
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainCorrectnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
LangChain Correctness Metric.
A metric that uses LangChain and OpenEvals to evaluate the correctness of the LLM.
Available Fields
- query (str): The query to evaluate the correctness of.
- generated_response (str): The generated response to evaluate the correctness of.
- expected_response (str): The expected response to evaluate the correctness of.
Scoring
- 0-1 (Continuous/Boolean): A higher score indicates better correctness.
Cookbook Example
Please refer to example_langchain_correctness.py in the gen-ai-sdk-cookbook repository.
Initialize the LangChainCorrectnessMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system
|
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials
|
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config
|
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema
|
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key
|
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous
|
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices
|
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning
|
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples
|
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainGroundednessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
LangChain Groundedness Metric.
A metric that uses LangChain and OpenEvals to evaluate the groundedness of the LLM.
Available Fields
- generated_response (str | list[str]): The generated response to evaluate the groundedness of.
- retrieved_context (str | list[str]): The retrieved context to evaluate the groundedness of.
Scoring
- 0-1 (Continuous/Boolean): A higher score indicates better groundedness.
Cookbook Example
Please refer to example_langchain_groundedness.py in the gen-ai-sdk-cookbook repository.
Initialize the LangChainGroundednessMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system
|
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials
|
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config
|
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema
|
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key
|
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous
|
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices
|
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning
|
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples
|
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainHallucinationMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
LangChain Hallucination Metric.
A metric that uses LangChain and OpenEvals to evaluate the hallucination of the LLM.
Available Fields
- query (str): The query to evaluate the hallucination of.
- generated_response (str): The generated response to evaluate the hallucination of.
- expected_retrieved_context (str): The expected retrieved context to evaluate the hallucination of.
- expected_response (str, optional): Additional information to help the model evaluate the hallucination.
Scoring
- 0-1 (Continuous/Boolean): 0 indicates no hallucination, 1 indicates hallucination.
Cookbook Example
Please refer to example_langchain_hallucination.py in the gen-ai-sdk-cookbook repository.
Initialize the LangChainHallucinationMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system
|
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials
|
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config
|
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema
|
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key
|
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous
|
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices
|
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning
|
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples
|
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainHelpfulnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsLLMAsAJudgeMetric
LangChain Helpfulness Metric.
A metric that uses LangChain and OpenEvals to evaluate the helpfulness of the LLM.
Available Fields
- query (str): The query to evaluate the helpfulness of.
- generated_response (str): The generated response to evaluate the helpfulness of.
Scoring
- 0-1 (Continuous/Boolean): A higher score indicates better helpfulness.
Cookbook Example
Please refer to example_langchain_helpfulness.py in the gen-ai-sdk-cookbook repository.
Initialize the LangChainHelpfulnessMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str | ModelId | BaseLMInvoker
|
The model to use. |
MODEL
|
system
|
str | None
|
Optional system message to prepend to the prompt. |
None
|
model_credentials
|
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
model_config
|
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema
|
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key
|
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous
|
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices
|
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning
|
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples
|
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
LangChainOpenEvalsLLMAsAJudgeMetric(name, prompt, model, system=None, credentials=None, config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)
Bases: LangChainOpenEvalsMetric
LangChain OpenEvals LLM as a Judge Metric.
A metric that uses LangChain and OpenEvals to evaluate the LLM as a judge.
Available Fields
- query (str | None, optional): The query / inputs to evaluate.
- generated_response (str | list[str] | None, optional): The generated response / outputs to evaluate.
- expected_response (str | list[str] | None, optional): The expected response / reference outputs to evaluate.
- expected_retrieved_context (str | list[str] | None, optional): The expected retrieved context / reference context.
- retrieved_context (str | list[str] | None, optional): The list of retrieved contexts.
Scoring
- 0.0-1.0 (Continuous): An evaluation score assigned by the LLM judge.
Initialize the LangChainOpenEvalsLLMAsAJudgeMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the metric. |
required |
prompt
|
str
|
The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages. |
required |
model
|
str | ModelId | BaseLMInvoker
|
The model to use. |
required |
system
|
str | None
|
Optional system message to prepend to the prompt. |
None
|
credentials
|
str | None
|
The credentials to use for the model. Defaults to None. |
None
|
config
|
dict[str, Any] | None
|
The config to use for the model. Defaults to None. |
None
|
schema
|
ResponseSchema | None
|
The schema to use for the model. Defaults to None. |
None
|
feedback_key
|
str
|
Key used to store the evaluation result, defaults to "score". |
'score'
|
continuous
|
bool
|
If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False. |
False
|
choices
|
list[float] | None
|
Optional list of specific float values the score must be chosen from. Defaults to None. |
None
|
use_reasoning
|
bool
|
If True, includes explanation for the score in the output. Defaults to True. |
True
|
few_shot_examples
|
list[FewShotExample] | None
|
Optional list of example evaluations to append to the prompt. Defaults to None. |
None
|
evaluate(data)
async
Evaluate with custom prompt lifecycle support.
Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.
For batch processing, checks for custom prompts and processes accordingly. Currently processes items individually; batch optimization can be added in future.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput | list[MetricInput]
|
Single data item or list of data items to evaluate. |
required |
Returns:
| Type | Description |
|---|---|
MetricOutput | list[MetricOutput]
|
Evaluation results with scores namespaced by metric name. |
LangChainOpenEvalsMetric(name, evaluator)
Bases: BaseMetric
LangChain OpenEvals Metric.
A metric that uses LangChain and OpenEvals.
Available Fields
- query (str | None, optional): The query / inputs to evaluate.
- generated_response (str | list[str] | None, optional): The generated response / outputs to evaluate.
- expected_response (str | list[str] | None, optional): The expected response / reference outputs to evaluate.
- expected_retrieved_context (str | list[str] | None, optional): The expected retrieved context / reference context.
- retrieved_context (str | list[str] | None, optional): The list of retrieved contexts.
Scoring
- 0.0-1.0 (Continuous): Depending on the specific OpenEval metric.
Initialize the LangChainOpenEvalsMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the metric. |
required |
evaluator
|
Union[SimpleAsyncEvaluator, Callable[..., Awaitable[Any]]]
|
The evaluator to use. |
required |
LanguageConsistencyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: LMBasedMetric
Language Consistency Metric.
This metric is used to evaluate whether the language of the generated response is consistent with the query.
Available Fields
- query (str): The query.
- generated_response (list[str]): The generated response.
Scoring
- 0-1 (Categorical): 0 means not consistent, 1 means fully consistent.
Cookbook Example
Please refer to example_language_consistency.py in the gen-ai-sdk-cookbook repository.
Initialize the LanguageConsistencyMetric class.
Default expected input: - query (str): The query to evaluate the language consistency of the model's output. - generated_response (str): The generated response to evaluate the language consistency of the model's output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
prompt_builder
|
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema
|
ResponseSchema | None
|
The response schema to use for the metric. Defaults to LanguageConsistencyResponseSchema. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
PyTrecMetric(metrics=None, k=20)
Bases: BaseMetric
PyTrec Metric.
A wrapper for pytrec_eval to evaluate common Information Retrieval (IR) metrics.
This metric allows you to compute various standard IR scores like NDCG, MAP, MRR, Reciprocal Rank, etc., based on
retrieved chunks and ground truth chunk IDs.
Available Fields
- retrieved_chunks (dict[str, float]): The retrieved chunk ids with their similarity score.
- ground_truth_chunk_ids (list[str]): The ground truth chunk ids.
Scoring
- 0.0-1.0 (Continuous): A higher score indicates better retrieval performance.
Cookbook Example
Please refer to example_pytrec_metric.py in the gen-ai-sdk-cookbook repository.
Initializes the PyTrecMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics
|
list[PyTrecEvalMetric | str] | set[PyTrecEvalMetric | str] | None
|
The metrics to evaluate. Defaults to all metrics. |
None
|
k
|
int | list[int]
|
The number of retrieved chunks to consider. Defaults to 20. |
20
|
RAGASMetric(metric, name=None, callbacks=None, timeout=None)
Bases: BaseMetric
RAGAS Metric.
RAGAS is a metric for evaluating the quality of RAG systems.
Available Fields
- query (str): The query to evaluate the metric. Similar to
user_inputinSingleTurnSample. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to
responseinSingleTurnSample. If the generated response is a list, the responses are concatenated into a single string. For multiple responses, use list[str]. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to
referenceinSingleTurnSample. If the expected response is a list, the responses are concatenated into a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric.
Similar to
reference_contextsinSingleTurnSample. If the expected retrieved context is a str, it will be converted into a list with a single element. - retrieved_context (str | list[str], optional): The retrieved context to evaluate the metric. Similar to
retrieved_contextsinSingleTurnSample. If the retrieved context is a str, it will be converted into a list with a single element. - rubrics (dict[str, str], optional): The rubrics to evaluate the metric. Similar to
rubricsinSingleTurnSample.
Scoring
- 0.0-1.0 (Continuous): A score evaluating the RAG aspect being tested.
Initialize the RAGASMetric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
SingleTurnMetric
|
The Ragas metric to use. |
required |
name
|
str
|
The name of the metric. Default is the name of the metric. |
None
|
callbacks
|
Callbacks
|
The callbacks to use. Default is None. |
None
|
timeout
|
int
|
The timeout for the metric. Default is None. |
None
|
evaluate(data)
async
Evaluate with custom prompt lifecycle support.
Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.
For batch processing, uses efficient parallel processing when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput | list[MetricInput]
|
Single data item or list of data items to evaluate. |
required |
Returns:
| Type | Description |
|---|---|
MetricOutput | list[MetricOutput]
|
Evaluation results with scores namespaced by metric name. |
RagasContextPrecisionWithoutReference(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)
Bases: RAGASMetric
RAGAS Context Precision Metric.
Measures the proportion of relevant chunks in the retrieved contexts without requiring a ground truth reference. It evaluates whether the retrieved context chunks are actually useful for generating the provided response to the user's query.
Available Fields
- query (str): The query to recall the context for.
- generated_response (str): The generated response to recall the context for.
- retrieved_contexts (list[str]): The retrieved contexts to recall the context for.
Scoring
- 0.0-1.0 (Continuous): A higher score indicates better context precision.
Cookbook Example
Please refer to example_ragas_context_precision.py in the gen-ai-sdk-cookbook repository.
Initialize the RagasContextPrecisionWithoutReference metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lm_model
|
str | ModelId | BaseLMInvoker
|
The language model to use. |
required |
lm_model_credentials
|
str | None
|
The credentials to use for the language model. Default is None. |
None
|
lm_model_config
|
dict[str, Any] | None
|
The configuration to use for the language model. Default is None. |
None
|
**kwargs
|
Additional keyword arguments to pass to the RagasContextRecall metric. |
{}
|
RagasContextRecall(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)
Bases: RAGASMetric
RAGAS Context Recall Metric.
Measures how many of the relevant documents (or pieces of information) needed to answer the query were successfully retrieved. It evaluates the retrieval system's ability to find all the necessary context based on the generated response and the expected response.
Available Fields
- query (str): The query to recall the context for.
- generated_response (str): The generated response to recall the context for.
- expected_response (str): The expected response to recall the context for.
- retrieved_contexts (list[str]): The retrieved contexts to recall the context for.
Scoring
- 0.0-1.0 (Continuous): A higher score indicates better context recall.
Cookbook Example
Please refer to example_ragas_context_recall.py in the gen-ai-sdk-cookbook repository.
Initialize the RagasContextRecall metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lm_model
|
str | ModelId | BaseLMInvoker
|
The language model to use. |
required |
lm_model_credentials
|
str | None
|
The credentials to use for the language model. Default is None. |
None
|
lm_model_config
|
dict[str, Any] | None
|
The configuration to use for the language model. Default is None. |
None
|
**kwargs
|
Additional keyword arguments to pass to the RagasContextRecall metric. |
{}
|
RagasFactualCorrectness(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, **kwargs)
Bases: RAGASMetric
RAGAS Factual Correctness metric.
This metric evaluates the factual accuracy of the generated response with the reference.
Available Fields
- query (str): The query.
- generated_response (str): The generated response.
Scoring
- 0-1 (Continuous): A higher score indicates better factual correctness.
Cookbook Example
Please refer to example_ragas_factual_correctness.py in the gen-ai-sdk-cookbook repository.
Initialize the RagasFactualCorrectness metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lm_model
|
str | ModelId | BaseLMInvoker
|
The language model to use. |
MODEL
|
lm_model_credentials
|
str | None
|
The credentials to use for the language model. Default is None. |
None
|
lm_model_config
|
dict[str, Any] | None
|
The configuration to use for the language model. Default is None. |
None
|
**kwargs
|
Additional keyword arguments to pass to the RagasFactualCorrectness metric. |
{}
|
RedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: LMBasedMetric
Redundancy metric.
This metric is used to evaluate the redundancy of the model's output.
Available Fields
- query (str): The query.
- generated_response (str): The generated response.
Scoring
- 1-3 (Continuous): Scale where 1 means no redundancy, 2 is at least one redundancy, and 3 is high redundancy.
Cookbook Example
Please refer to example_redundancy.py in the gen-ai-sdk-cookbook repository.
Initialize the RedundancyMetric class.
Default expected input: - query (str): The query to evaluate the redundancy of the model's output. - generated_response (str): The generated response to evaluate the redundancy of the model's output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
prompt_builder
|
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema
|
ResponseSchema | None
|
The response schema to use for the metric. Defaults to RedundancyResponseSchema. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
RefusalAlignmentMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: LMBasedMetric
Refusal Alignment metric.
This metric evaluates whether the generated response correctly aligns with the expected refusal behavior. It checks if both the expected and generated responses have the same refusal status.
Available Fields
- query (str): The query to evaluate the metric.
- expected_response (str): The expected response to evaluate the metric.
- generated_response (str): The generated response to evaluate the metric.
- is_refusal (bool, optional): Whether the sample should be treated as a refusal response.
Scoring
- 0-1 (Categorical): 0 indicates incorrect alignment, 1 indicates correct alignment.
Cookbook Example
Please refer to example_refusal_alignment.py in the gen-ai-sdk-cookbook repository.
Initialize the RefusalAlignmentMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
prompt_builder
|
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema
|
ResponseSchema | None
|
The response schema to use for the metric. Defaults to RefusalAlignmentResponseSchema. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
RefusalMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: LMBasedMetric
Refusal metric.
This metric is used to evaluate the refusal of the model's output.
Available Fields
- query (str): The query.
- expected_response (str): The expected response.
Scoring
- 0-1 (Categorical): 0 means not refusal, 1 means refusal.
Cookbook Example
Please refer to example_refusal.py in the gen-ai-sdk-cookbook repository.
Initialize the RefusalMetric class.
Default expected input: - query (str): The query to evaluate the refusal of the model's output. - expected_response (str): The expected response to evaluate the refusal of the model's output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Union[str, ModelId, BaseLMInvoker]
|
The model to use for the metric. |
MODEL
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to an empty dictionary. |
None
|
prompt_builder
|
PromptBuilder | None
|
The prompt builder to use for the metric. Defaults to default prompt builder. |
None
|
response_schema
|
ResponseSchema | None
|
The response schema to use for the metric. Defaults to RefusalResponseSchema. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
TopKAccuracy(k=20)
Bases: BaseMetric
Top-K Accuracy Metric.
Evaluates whether the ground truth chunk IDs are present within the top K retrieved chunks. This is a boolean-style hit/miss metric averaged over the dataset; a score of 1.0 means the relevant document was always found in the top K results.
Available Fields
- retrieved_chunks (dict[str, float]): The retrieved chunk ids with their similarity score.
- ground_truth_chunk_ids (list[str]): The ground truth chunk ids.
Scoring
- 0.0-1.0 (Continuous): A higher score indicates better top-k accuracy.
Cookbook Example
Please refer to example_top_k_accuracy.py in the gen-ai-sdk-cookbook repository.
Initializes the TopKAccuracy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k
|
list[int] | int
|
The number of retrieved chunks to consider. Defaults to 20. |
20
|
top_k_accuracy(qrels, results)
Evaluates the top k accuracy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
qrels
|
dict[str, dict[str, int]]
|
The ground truth of the retrieved chunks. There are two possible values: |
required |
- 1
|
the chunk is relevant to the query. |
required | |
- 0
|
the chunk is not relevant to the query. |
required | |
results
|
dict[str, dict[str, float]]
|
The retrieved chunks with their similarity score. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
dict[str, float]: The top k accuracy. |
Example
qrels = {
"q1": {"chunk1": 1, "chunk2": 1},
}
results = {"q1": {"chunk1": 0.9, "chunk2": 0.8, "chunk3": 0.7}}