Skip to content

Metrics

Metrics module for evaluating AI model outputs.

This module provides a comprehensive collection of evaluation metrics for assessing the quality of generated content, retrieval systems, and AI agent responses. It includes both traditional metrics and LLM-based metrics, as well as integrations with popular evaluation frameworks.

Metric categories: - Generation metrics: Evaluate quality of generated text (completeness, groundedness, redundancy, language consistency, refusal alignment) - Retrieval metrics: Assess retrieval system performance (precision, recall, accuracy) - Agent metrics: Evaluate AI agent behavior and responses - Open-source integrations: Wrappers for RAGAS, DeepEval, and LangChain evaluators

BaseMetric

Bases: ABC

Abstract class for metrics.

This class defines the interface for all metrics.

Attributes:

Name Type Description
name str

The name of the metric.

required_fields set[str]

The required fields for this metric to evaluate data.

input_type type | None

The type of the input data.

Example

Adding custom prompts to existing evaluator metrics:

from gllm_evals import load_simple_qa_dataset
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.utils.shared_functionality import inference_fn


async def main():
    # Main function with custom prompts

    # Load your dataset
    dataset = load_simple_qa_dataset()

    # Create evaluator with default metrics
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Add custom prompts polymorphically (works for any metric)
    for metric in evaluator.metrics:
        if hasattr(metric, 'name'):  # Ensure metric has name attribute
            # Add custom prompts based on metric name
            if metric.name == "geval_completeness":
                # Add domain-specific few-shot examples
                metric.additional_context += "\n\nCUSTOM MEDICAL EXAMPLES: ..."
            elif metric.name == "geval_groundedness":
                # Add grounding examples
                metric.additional_context += "\n\nMEDICAL GROUNDING EXAMPLES: ..."

    # Evaluate with custom prompts applied automatically
    results = await evaluate(
        data=dataset,
        inference_fn=inference_fn,
        evaluators=[evaluator],  # ← Custom prompts applied to metrics
    )

can_evaluate(data)

Check if this metric can evaluate the given data.

Parameters:

Name Type Description Default
data MetricInput

The input data to check.

required

Returns:

Name Type Description
bool bool

True if the metric can evaluate the data, False otherwise.

evaluate(data) async

Evaluate the metric on the given dataset (single item or batch).

Automatically handles batch processing by default. Subclasses can override _evaluate to accept lists for optimized batch processing.

Parameters:

Name Type Description Default
data MetricInput | list[MetricInput]

The data to evaluate the metric on. Can be a single item or a list for batch processing.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

MetricOutput | list[MetricOutput]: A dictionary where the key are the namespace and the value are the scores. Returns a list if input is a list.

get_input_fields() classmethod

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type Description
list[str] | None

list[str] | None: The input fields.

get_input_spec() classmethod

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type Description
list[dict[str, Any]] | None

list[dict[str, Any]] | None: The input spec.

get_normalized_score(raw_score)

Normalize raw score to 0-1 range based on metric's good_score and bad_score.

This method handles both
  • Different scales (e.g., 1-3 for completeness, 0-1 for language_consistency)
  • Inverted scales (e.g., redundancy where lower is better)

Parameters:

Name Type Description Default
raw_score float

The raw score value from the metric evaluation.

required

Returns:

Name Type Description
float float

Normalized score between 0 and 1, where 1 is best and 0 is worst.

Examples:

>>> # Completeness: good=3, bad=1 (higher is better)
>>> metric.get_normalized_score(2)  # Returns 0.5
>>> # Redundancy: good=1, bad=3 (lower is better)
>>> metric.get_normalized_score(2)  # Returns 0.5 (inverted)
>>> # Language Consistency: good=1, bad=0 (already 0-1)
>>> metric.get_normalized_score(0.5)  # Returns 0.5

CompletenessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: LMBasedMetric

Completeness metric.

This metric is used to evaluate the completeness of the model's output compared to the expected output.

Available Fields
  • query (str): The query.
  • generated_response (str): The generated response.
  • expected_response (str): The expected response.
Scoring
  • 1-3 (Continuous): Scale where 1 is not complete, 2 is incomplete, and 3 is complete.
Cookbook Example

Please refer to example_completeness.py in the gen-ai-sdk-cookbook repository.

Initialize the CompletenessMetric class.

Parameters:

Name Type Description Default
model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

None
prompt_builder PromptBuilder | None

The prompt builder to use for the metric. Defaults to default prompt builder.

None
response_schema ResponseSchema | None

The response schema to use for the metric. Defaults to CompletenessResponseSchema.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

BATCH_MAX_ITERATIONS

DeepEvalAnswerRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Answer Relevancy Metric Integration.

This metric uses LLM-as-a-judge to assess whether the output is relevant to the given input.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
  • 0-1 (Continuous): A higher score indicates better answer relevancy.
Cookbook Example

Please refer to example_deepeval_answer_relevancy.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalAnswerRelevancyMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalBiasMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Bias Metric Integration.

This metric uses LLM-as-a-judge to assess whether the LLM application's output contains racial, political, or other forms of offensive bias.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more biased, closer to 0.0 means less biased.
Cookbook Example

Please refer to example_deepeval_bias.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalBiasMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalContextualPrecisionMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, evaluation_template=ContextualPrecisionTemplate)

Bases: DeepEvalMetricFactory

DeepEval Contextual Precision Metric.

Evaluates whether the retrieval context that are relevant to the given query are ranked higher than irrelevant ones. A higher score indicates better contextual precision, meaning relevant context chunks appear earlier in the retrieved results.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • expected_response (str | list[str]): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string.
  • retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better contextual precision.
Cookbook Example

Please refer to example_deepeval_contextual_precision.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalContextualPrecisionMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS
evaluation_template Type[ContextualPrecisionTemplate]

The evaluation template to use for the metric. Defaults to ContextualPrecisionTemplate. It is used to generate the reason for the metric.

ContextualPrecisionTemplate

DeepEvalContextualRecallMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, evaluation_template=ContextualRecallTemplate)

Bases: DeepEvalMetricFactory

DeepEval Contextual Recall Metric.

Evaluates the extent to which the retrieved context aligns with the expected output. A higher score indicates better contextual recall, meaning the retrieval system successfully found the information needed to generate the expected response.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • expected_response (str | list[str]): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string.
  • retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better contextual recall.
Cookbook Example

Please refer to example_deepeval_contextual_recall.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalContextualRecallMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS
evaluation_template Type[ContextualRecallTemplate]

The evaluation template to use for the metric. Defaults to ContextualRecallTemplate.

ContextualRecallTemplate

DeepEvalContextualRelevancyMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Contextual Relevancy Metric.

Evaluates the overall relevance of the information presented in the retrieved context for a given query. A higher score indicates better contextual relevancy, meaning the retrieved context chunks contain less irrelevant or tangential information.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
  • retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better contextual relevancy.
Cookbook Example

Please refer to example_deepeval_contextual_relevancy.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalContextualRelevancyMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalFaithfulnessMetric(threshold=0.5, model=DefaultValues.MODEL, model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Faithfulness Metric Integration.

This metric uses LLM-as-a-judge to assess whether the answers rely solely on the retrieved context, without hallucinating or providing misinformation.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
  • retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
  • 0-1 (Continuous): A higher score indicates better faithfulness.
Cookbook Example

Please refer to example_deepeval_faithfulness.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalFaithfulnessMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalGEvalMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: DeepEvalMetricFactory, PromptExtractionMixin

DeepEval GEval Metric Integration.

This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.

GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.

Available Fields
  • query (str, optional): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated to a single string.
  • expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated to a single string.
  • expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted to a list with a single element.
  • retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
  • 0.0-1.0 (Continuous): Or Boolean depending on the DeepEval GEval configuration.

Initializes the DeepEvalGEvalMetric class.

Parameters:

Name Type Description Default
name str | None

The name of the metric. Defaults to None. Required if not provided via _defaults.

None
evaluation_params list[LLMTestCaseParams] | None

The evaluation parameters. Defaults to None. Required if not provided via _defaults.

None
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

MODEL
criteria str | None

The criteria to use for the metric. Defaults to None.

None
evaluation_steps list[str] | None

The evaluation steps to use for the metric. Defaults to None.

None
rubric list[Rubric] | None

The rubric to use for the metric. Defaults to None.

None
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
additional_context str | None

Additional context like few-shot examples. Defaults to None.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

BATCH_MAX_ITERATIONS

evaluate(data) async

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, checks for custom prompts and processes accordingly. Currently processes items individually; batch optimization can be added in future.

Parameters:

Name Type Description Default
data MetricInput | list[MetricInput]

Single data item or list of data items to evaluate.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

Evaluation results with scores namespaced by metric name.

get_custom_prompt_base_name()

Get the base name for custom prompt column lookup.

For GEval metrics, removes the 'geval_' prefix to align with CSV column conventions. This fixes Issue #3 by providing polymorphic naming for GEval metrics.

Returns:

Name Type Description
str str

The base name without 'geval_' prefix (e.g., "completeness" instead of "geval_completeness").

Example

metric.name = "geval_completeness" metric.get_custom_prompt_base_name() -> "completeness"

CSV columns expected: - fewshot_completeness - fewshot_completeness_mode - evaluation_step_completeness

get_full_prompt(data)

Get the full prompt that DeepEval generates for this metric.

Parameters:

Name Type Description Default
data MetricInput

The metric input.

required

Returns:

Name Type Description
str str

The complete prompt (system + user) as a string.

DeepEvalHallucinationMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Hallucination Metric Integration.

This metric uses LLM-as-a-judge to determine whether the output contains hallucinated or incorrect information based on the retrieved context.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
  • expected_retrieved_context (str | list[str]): The expected context to evaluate the metric. Similar to context in LLMTestCaseParams.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more hallucinated, closer to 0.0 means less hallucinated.
Cookbook Example

Please refer to example_deepeval_hallucination.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalHallucinationMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalJsonCorrectnessMetric(expected_schema, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval JSON Correctness Metric Integration.

This metric evaluates whether a response is JSON correct according to a specified schema. It helps ensure that AI responses follow the expected JSON structure.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
  • 0.0-1.0 (Categorical): 0.0 means the response is not JSON correct according to the schema, 1.0 means the response is JSON correct according to the schema.
Cookbook Example

Please refer to example_deepeval_json_correctness.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalJsonCorrectnessMetric class.

Parameters:

Name Type Description Default
expected_schema Type[BaseModel]

The expected schema class (not instance) for the response. Example: ExampleSchema (the class, not an instance).

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4o".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If expected_schema is not a valid BaseModel class.

DeepEvalMetric(metric, name)

Bases: BaseMetric

DeepEval Metric.

A wrapper for DeepEval metrics.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
  • expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to expected_output in LLMTestCaseParams. If the expected response is a list, the responses are concatenated into a single string.
  • expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to context in LLMTestCaseParams. If the expected retrieved context is a str, it will be converted into a list with a single element.
  • retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to retrieval_context in LLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
  • 0.0-1.0 (Continuous): Or Boolean depending on the DeepEval metric.

Initializes the DeepEvalMetric class.

Parameters:

Name Type Description Default
metric BaseMetric

The DeepEval metric to wrap.

required
name str

The name of the metric.

required

DeepEvalMetricFactory(name, model, model_credentials, model_config, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS, **kwargs)

Bases: DeepEvalMetric, ABC

DeepEval Metric Factory.

Abstract base class for creating DeepEval metrics with a shared model invoker.

Available Fields
  • (Dynamic): Depends on the specific DeepEval metric being created.
Scoring
  • (Dynamic): Depends on the specific DeepEval metric.

Initializes the metric, handling common model invoker creation.

Parameters:

Name Type Description Default
name str

The name for the metric.

required
model Union[str, ModelId, BaseLMInvoker]

The model identifier or an existing LM invoker instance.

required
model_credentials Optional[str]

Credentials for the model, required if model is a string.

required
model_config Optional[Dict[str, Any]]

Configuration for the model.

required
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

BATCH_MAX_ITERATIONS
**kwargs

Additional arguments for the specific DeepEval metric constructor.

{}

DeepEvalMisuseMetric(domain, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Misuse Metric Integration.

This metric evaluates whether a response contains inappropriate misuse of the model. It helps ensure that AI responses don't provide harmful or inappropriate misuse of the model.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more misuse, closer to 0.0 means less misuse.
Cookbook Example

Please refer to example_deepeval_misuse.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalMisuseMetric class.

Parameters:

Name Type Description Default
domain str

The domain to evaluate the metric. Common domains include: "finance", "health", "legal", "personal", "investment".

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4o".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If domain is empty or contains invalid values.

DeepEvalNonAdviceMetric(advice_types, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Non-Advice Metric Integration.

This metric evaluates whether a response contains inappropriate advice types. It helps ensure that AI responses don't provide harmful or inappropriate advice.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 0.0 means more inappropriate advice, closer to 1.0 means less inappropriate advice.
Cookbook Example

Please refer to example_deepeval_non_advice.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalNonAdviceMetric class.

Parameters:

Name Type Description Default
advice_types List[str]

List of advice types to detect as inappropriate. Common types include: ["financial", "medical", "legal", "personal", "investment"].

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4o".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If advice_types is empty or contains invalid values.

DeepEvalPIILeakageMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval PII Leakage Metric Integration.

This metric uses LLM-as-a-judge to assess whether the LLM application's output contains leaked PII.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more privacy violations, closer to 0.0 means less privacy violations.
Cookbook Example

Please refer to example_deepeval_pii_leakage.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalPIILeakageMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

DeepEvalPromptAlignmentMetric(prompt_instructions, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Prompt Alignment Metric Integration.

This metric evaluates whether a response is aligned with the prompt instructions. It helps ensure that AI responses are aligned with the prompt instructions.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output inLLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more aligned with the prompt instructions, closer to 0.0 means less aligned with the prompt instructions.
Cookbook Example

Please refer to example_deepeval_prompt_alignment.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalPromptAlignmentMetric class.

Parameters:

Name Type Description Default
prompt_instructions List[str]

a list of strings specifying the instructions you want followed in your prompt template.

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4o".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If prompt_instructions is empty or contains invalid values.

DeepEvalRoleViolationMetric(role, threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Role Violation Metric Integration.

This metric evaluates whether a response contains role violations. It helps ensure that AI responses don't provide harmful or inappropriate role violations.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 0.0 means more role violations, closer to 1.0 means less role violations.
Cookbook Example

Please refer to example_deepeval_role_violation.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalRoleViolationMetric class.

Parameters:

Name Type Description Default
role str

The role to evaluate the metric. Common roles include: "helpful customer assistant", "medical insurance agent".

required
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4o".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

Raises:

Type Description
ValueError

If role is empty or contains invalid values.

DeepEvalToolCorrectnessMetric(threshold=0.5, model=DefaultValues.AGENT_EVALS_MODEL, model_credentials=None, model_config=None, include_reason=True, strict_mode=False, should_exact_match=False, should_consider_ordering=False, available_tools=None, evaluation_params=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: DeepEvalMetricFactory

DeepEval Tool Correctness Metric.

This metric evaluates whether an agent correctly selects and calls tools based on the user query and available tool definitions.

Available Fields
  • query (str): The input query.
  • generated_response (str, optional): The actual output/response.
  • expected_response (str, optional): The expected output/response.
  • tools_called (list[dict], optional): The tools actually called by the agent. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from agent_trajectory.
  • expected_tools (list[dict], optional): The expected tools to be called. Each dict should have 'name', 'args', and optionally 'output'. If not provided, will be extracted from expected_agent_trajectory.
  • agent_trajectory (list[dict], optional): Agent messages in OpenAI format (role-based with 'user', 'assistant', 'tool' roles). Tool calls are extracted from assistant messages with 'tool_calls' field.
  • expected_agent_trajectory (list[dict], optional): Expected agent messages in OpenAI format, same structure as agent_trajectory.
  • available_tools (list[dict] | None, optional): All available tools for context. Each tool definition should have 'name', 'description', and 'parameters'.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more correct tool usage, closer to 0.0 means less correct tool usage.
Cookbook Example

Please refer to example_deepeval_tool_correctness.py in the gen-ai-sdk-cookbook repository.

Initializes DeepEvalToolCorrectnessMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5. Also used as good_score for BaseEvaluator's global explanation generation.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

AGENT_EVALS_MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None
include_reason bool

Include reasoning in output. Defaults to True.

True
strict_mode bool

Binary mode (0 or 1). Defaults to False.

False
should_exact_match bool

Require exact match of tools. Defaults to False.

False
should_consider_ordering bool

Consider order of tools called. Defaults to False.

False
available_tools list[dict[str, Any]] | None

List of available tool definitions for context. Each tool should have 'name', 'description', and 'parameters'. Defaults to None.

None
evaluation_params list[ToolCallParams] | None

List of strictness criteria for tool correctness. Available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT. Defaults to [ToolCallParams.INPUT_PARAMETERS, ToolCallParams.OUTPUT] to validate both.

None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS

DeepEvalToxicityMetric(threshold=0.5, model='openai/gpt-4.1', model_credentials=None, model_config=None)

Bases: DeepEvalMetricFactory

DeepEval Toxicity Metric Integration.

This metric uses LLM-as-a-judge to assess whether a response contains toxic content.

Available Fields
  • query (str): The query to evaluate the metric. Similar to input in LLMTestCaseParams.
  • generated_response (str | list[str]): The generated response to evaluate the metric. Similar to actual_output in LLMTestCaseParams. If the generated response is a list, the responses are concatenated into a single string.
Scoring
  • 0.0-1.0 (Continuous): Scale where closer to 1.0 means more toxic, closer to 0.0 means less toxic.
Cookbook Example

Please refer to example_deepeval_toxicity.py in the gen-ai-sdk-cookbook repository.

Initializes the DeepEvalToxicityMetric class.

Parameters:

Name Type Description Default
threshold float

The threshold to use for the metric. Defaults to 0.5.

0.5
model str | ModelId | BaseLMInvoker

The model to use for the metric. Defaults to "openai/gpt-4.1".

'openai/gpt-4.1'
model_credentials str | None

The model credentials to use for the metric. Defaults to None. Required when model is a string.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to None.

None

GEvalCompletenessMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: DeepEvalGEvalMetric

GEval Completeness Metric.

This metric is used to evaluate the completeness of the generated output.

Available Fields
  • query (str): The query to evaluate the completeness of the model's output.
  • generated_response (str): The generated response to evaluate the completeness of the model's output.
  • expected_response (str): The expected response to evaluate the completeness of the model's output.
Scoring
  • 1-3 (Continuous): Scale where 1 means not complete, 2 means incomplete, and 3 means complete.
Cookbook Example

Please refer to example_geval_completeness.py in the gen-ai-sdk-cookbook repository.

GEvalContextSufficiencyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: DeepEvalGEvalMetric

GEval Context Sufficiency Metric.

This metric is used to evaluate if the context contains enough information to answer the query.

Available Fields
  • query (str): The query to evaluate.
  • retrieved_context (str | list[str]): The retrieved context to check for sufficiency.
Scoring
  • 0-1 (Boolean): Where 0 means insufficient context and 1 means sufficient context.
Cookbook Example

Please refer to example_geval_context_sufficiency.py in the gen-ai-sdk-cookbook repository.

GEvalGroundednessMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: DeepEvalGEvalMetric

GEval Groundedness Metric.

This metric is used to evaluate the groundedness of the generated output.

Available Fields
  • query (str): The query to evaluate the groundedness of the model's output.
  • generated_response (str): The generated response to evaluate the groundedness of the model's output.
  • retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
Scoring
  • 1-3 (Continuous): Scale where 1 means not grounded, 2 means at least one grounded, and 3 means fully grounded.
Cookbook Example

Please refer to example_geval_groundedness.py in the gen-ai-sdk-cookbook repository.

GEvalLanguageConsistencyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: DeepEvalGEvalMetric

GEval Language Consistency Metric.

This metric is used to predict if the question and generated response is a language consistency response.

Available Fields
  • query (str): The query to predict if it is a language consistency response.
  • generated_response (str): The generated response to predict if it is a language consistency response.
Scoring
  • 0-1 (Categorical): 0 means not consistent, 1 means fully consistent.
Cookbook Example

Please refer to example_geval_language_consistency.py in the gen-ai-sdk-cookbook repository.

GEvalRedundancyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: DeepEvalGEvalMetric

GEval Redundancy Metric.

This metric is used to evaluate the redundancy of the generated output.

Available Fields
  • query (str): The query to evaluate the redundancy of the model's output.
  • generated_response (str): The generated response to evaluate the redundancy of the model's output.
Scoring
  • 1-3 (Continuous): Scale where 1 means no redundancy, 2 is at least one redundancy, and 3 is high redundancy.
Cookbook Example

Please refer to example_geval_redundancy.py in the gen-ai-sdk-cookbook repository.

get_normalized_score(raw_score)

Normalize raw score to 0-1 range.

For redundancy
  • Score ≤ 2: Good (normalized to 1.0)
  • Score ≥ 3: Bad (normalized to 0.0)

This override handles scores outside the [good_score, bad_score] range that would otherwise produce values outside 0-1.

Parameters:

Name Type Description Default
raw_score float

The raw score value from the metric evaluation (1-3).

required

Returns:

Name Type Description
float float

Normalized score between 0 and 1, where 1 is best and 0 is worst.

GEvalRefusalAlignmentMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: DeepEvalGEvalMetric

GEval Refusal Alignment Metric.

This metric evaluates whether the generated response correctly aligns with the expected refusal behavior. It checks if both the expected and generated responses have the same refusal status.

Available Fields
  • query (str): The query to evaluate the metric.
  • expected_response (str): The expected response to evaluate the metric.
  • generated_response (str): The generated response to evaluate the metric.
  • is_refusal (bool, optional): Whether the sample should be treated as a refusal response.
Scoring
  • 0-1 (Categorical): 0 indicates incorrect alignment, 1 indicates correct alignment.
Cookbook Example

Please refer to example_geval_refusal_alignment.py in the gen-ai-sdk-cookbook repository.

GEvalRefusalMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: DeepEvalGEvalMetric

GEval Refusal Metric.

This metric is used to predict if the question and expected response is a refusal response.

Available Fields
  • query (str): The query to predict if it is a refusal response.
  • expected_response (str): The expected response to predict if it is a refusal response.
Scoring
  • 0-1 (Categorical): 0 means not refusal, 1 means refusal.
Cookbook Example

Please refer to example_geval_refusal.py in the gen-ai-sdk-cookbook repository.

GEvalSummarizationCoherenceMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: GEvalSummarizationBaseMetric

GEval Summarization Coherence metric.

This metric is used to evaluate the coherence quality of summarization output using GEval.

Available Fields
  • input (str): Source text or transcript.
  • summary (str): Generated summary.
Scoring
  • 1-3 (Continuous): A higher score indicates better coherence.
Cookbook Example

Please refer to example_geval_summarization_coherence.py in the gen-ai-sdk-cookbook repository.

GEvalSummarizationConsistencyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: GEvalSummarizationBaseMetric

GEval Summarization Consistency metric.

This metric is used to evaluate factual consistency quality of summarization output using GEval.

Available Fields
  • input (str): Source text or transcript.
  • summary (str): Generated summary.
Scoring
  • 1-3 (Continuous): A higher score indicates better consistency.
Cookbook Example

Please refer to example_geval_summarization_consistency.py in the gen-ai-sdk-cookbook repository.

GEvalSummarizationFluencyMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: GEvalSummarizationBaseMetric

GEval Summarization Fluency metric.

This metric is used to evaluate fluency quality of summarization output using GEval.

Available Fields
  • input (str): Source text or transcript.
  • summary (str): Generated summary.
Scoring
  • 1-3 (Continuous): A higher score indicates better fluency.
Cookbook Example

Please refer to example_geval_summarization_fluency.py in the gen-ai-sdk-cookbook repository.

GEvalSummarizationRelevanceMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: GEvalSummarizationBaseMetric

GEval Summarization Relevance metric.

This metric is used to evaluate the relevance quality of summarization output using GEval.

Available Fields
  • input (str): Source text or transcript.
  • summary (str): Generated summary.
Scoring
  • 1-3 (Continuous): A higher score indicates better relevance.
Cookbook Example

Please refer to example_geval_summarization_relevance.py in the gen-ai-sdk-cookbook repository.

GroundednessMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: LMBasedMetric

Groundedness metric.

This metric is used to evaluate how grounded the generated response is based on the retrieved context.

Available Fields
  • query (str): The query to evaluate.
  • generated_response (str): The generated response to evaluate.
  • retrieved_context (str): The retrieved context to evaluate.
Scoring
  • 1-3 (Continuous): Scale where 1 is not grounded, 2 is at least one grounded, and 3 is fully grounded.
Cookbook Example

Please refer to example_groundedness.py in the gen-ai-sdk-cookbook repository.

Initialize the GroundednessMetric class.

Default expected input: - query (str): The query to evaluate the groundedness of the model's output. - retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output. - generated_response (str): The generated response to evaluate the groundedness of the model's output.

Parameters:

Name Type Description Default
model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

MODEL
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

None
model_credentials str | None

The model credentials to use for the metric. Defaults to None.

None
prompt_builder PromptBuilder | None

The prompt builder to use for the metric. Defaults to default prompt builder.

None
response_schema ResponseSchema | None

The response schema to use for the metric. Defaults to GroundednessResponseSchema.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

BATCH_MAX_ITERATIONS

LMBasedMetric(name, response_schema, prompt_builder, model=DefaultValues.MODEL, model_credentials=None, model_config=None, parse_response_fn=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: BaseMetric

A multi purpose LM-based metric class.

This class is a multi purpose LM-based metric class that can be used to evaluate the performance of a LM-based metric. It can be used to evaluate the performance of a LM-based metric by providing a response schema, a prompt builder, a model id, and a model credentials.

Available Fields
  • (Dynamic): Depends on the prompt_builder and specific metric implementation.
Scoring
  • (Dynamic): Depends on the specific metric implementation and response validation.

Initialize the LMBasedMetric class.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
response_schema ResponseSchema

The response schema to use for the metric.

required
prompt_builder PromptBuilder

The prompt builder to use for the metric.

required
model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

None
parse_response_fn Callable[[str | LMOutput], MetricOutput] | None

The function to use to parse the response from the LM. Defaults to a function that parses the response from the LM.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120 (60 minutes with default interval).

BATCH_MAX_ITERATIONS

evaluate(data) async

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, uses efficient batch API when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.

Parameters:

Name Type Description Default
data MetricInput | list[MetricInput]

Single data item or list of data items to evaluate.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

Evaluation results with scores namespaced by metric name.

LangChainAgentEvalsLLMAsAJudgeMetric(name, prompt, model, credentials=None, config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: LangChainAgentEvalsMetric

LangChain AgentEvals LLM as a Judge Metric.

A metric that uses LangChain AgentEvals to evaluate Agent as a judge.

Available Fields
  • agent_trajectory (list[dict[str, Any]]): The agent trajectory.
  • expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.
Scoring
  • 0.0-1.0 (Continuous): An evaluation score assigned by the LLM judge based on the trajectory.

Initialize the LangChainAgentEvalsLLMAsAJudgeMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
prompt str

The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages. Note that the default prompt allows a rubric in addition to the typical "inputs", "outputs", and "reference_outputs" parameters.

required
model str | ModelId | BaseLMInvoker

The model to use.

required
credentials str | None

The credentials to use for the model. Defaults to None.

None
config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "trajectory_accuracy".

'trajectory_accuracy'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS

evaluate(data) async

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, uses efficient batch API when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.

Parameters:

Name Type Description Default
data MetricInput | list[MetricInput]

Single data item or list of data items to evaluate.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

Evaluation results with scores namespaced by metric name.

LangChainAgentEvalsMetric(name, evaluator)

Bases: BaseMetric

LangChain AgentEvals Metric.

A metric that uses LangChain AgentEvals to evaluate Agent.

Available Fields
  • agent_trajectory (list[dict[str, Any]]): The agent trajectory.
  • expected_agent_trajectory (list[dict[str, Any]]): The expected agent trajectory.
Scoring
  • 0.0-1.0 (Continuous): An evaluation score based on the trajectory.

Initialize the LangChainAgentEvalsMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
evaluator SimpleAsyncEvaluator

The evaluator to use.

required

LangChainAgentTrajectoryAccuracyMetric(model, prompt=None, model_credentials=None, model_config=None, schema=None, feedback_key='trajectory_accuracy', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None, use_reference=True, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: LangChainAgentEvalsLLMAsAJudgeMetric

LangChain Agent Trajectory Accuracy Metric.

A metric that uses LangChain AgentEvals to evaluate the trajectory accuracy of the agent.

Available Fields
  • agent_trajectory (list[dict[str, Any]]): The agent trajectory.
  • expected_agent_trajectory (list[dict[str, Any]] | None, optional): The expected agent trajectory.
Scoring
  • 0-1 (Continuous/Boolean): A higher score indicates better trajectory accuracy.
Cookbook Example

Please refer to example_langchain_agent_trajectory_accuracy.py in the gen-ai-sdk-cookbook repository.

Initialize the LangChainAgentTrajectoryAccuracyMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

required
prompt str | None

The prompt to use. Defaults to None.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "trajectory_accuracy".

'trajectory_accuracy'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None
use_reference bool

If True, uses the expected agent trajectory to evaluate the trajectory accuracy. Defaults to True. If False, the TRAJECTORY_ACCURACY_CUSTOM_PROMPT is used to evaluate the trajectory accuracy. If custom prompt is provided, this parameter will be ignored.

True
batch_status_check_interval float

Interval in seconds between batch status checks. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of batch status check iterations. Defaults to 120.

BATCH_MAX_ITERATIONS

LangChainConcisenessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

LangChain Conciseness Metric.

A metric that uses LangChain and OpenEvals to evaluate the conciseness of the LLM.

Available Fields
  • query (str): The query to evaluate the conciseness of.
  • generated_response (str): The generated response to evaluate the conciseness of.
Scoring
  • 0-1 (Continuous/Boolean): A higher score indicates better conciseness.
Cookbook Example

Please refer to example_langchain_conciseness.py in the gen-ai-sdk-cookbook repository.

Initialize the LangChainConcisenessMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

LangChainCorrectnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

LangChain Correctness Metric.

A metric that uses LangChain and OpenEvals to evaluate the correctness of the LLM.

Available Fields
  • query (str): The query to evaluate the correctness of.
  • generated_response (str): The generated response to evaluate the correctness of.
  • expected_response (str): The expected response to evaluate the correctness of.
Scoring
  • 0-1 (Continuous/Boolean): A higher score indicates better correctness.
Cookbook Example

Please refer to example_langchain_correctness.py in the gen-ai-sdk-cookbook repository.

Initialize the LangChainCorrectnessMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

LangChainGroundednessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

LangChain Groundedness Metric.

A metric that uses LangChain and OpenEvals to evaluate the groundedness of the LLM.

Available Fields
  • generated_response (str | list[str]): The generated response to evaluate the groundedness of.
  • retrieved_context (str | list[str]): The retrieved context to evaluate the groundedness of.
Scoring
  • 0-1 (Continuous/Boolean): A higher score indicates better groundedness.
Cookbook Example

Please refer to example_langchain_groundedness.py in the gen-ai-sdk-cookbook repository.

Initialize the LangChainGroundednessMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

LangChainHallucinationMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

LangChain Hallucination Metric.

A metric that uses LangChain and OpenEvals to evaluate the hallucination of the LLM.

Available Fields
  • query (str): The query to evaluate the hallucination of.
  • generated_response (str): The generated response to evaluate the hallucination of.
  • expected_retrieved_context (str): The expected retrieved context to evaluate the hallucination of.
  • expected_response (str, optional): Additional information to help the model evaluate the hallucination.
Scoring
  • 0-1 (Continuous/Boolean): 0 indicates no hallucination, 1 indicates hallucination.
Cookbook Example

Please refer to example_langchain_hallucination.py in the gen-ai-sdk-cookbook repository.

Initialize the LangChainHallucinationMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

LangChainHelpfulnessMetric(model=DefaultValues.MODEL, system=None, model_credentials=None, model_config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainOpenEvalsLLMAsAJudgeMetric

LangChain Helpfulness Metric.

A metric that uses LangChain and OpenEvals to evaluate the helpfulness of the LLM.

Available Fields
  • query (str): The query to evaluate the helpfulness of.
  • generated_response (str): The generated response to evaluate the helpfulness of.
Scoring
  • 0-1 (Continuous/Boolean): A higher score indicates better helpfulness.
Cookbook Example

Please refer to example_langchain_helpfulness.py in the gen-ai-sdk-cookbook repository.

Initialize the LangChainHelpfulnessMetric.

Parameters:

Name Type Description Default
model str | ModelId | BaseLMInvoker

The model to use.

MODEL
system str | None

Optional system message to prepend to the prompt.

None
model_credentials str | None

The credentials to use for the model. Defaults to None.

None
model_config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

LangChainOpenEvalsLLMAsAJudgeMetric(name, prompt, model, system=None, credentials=None, config=None, schema=None, feedback_key='score', continuous=False, choices=None, use_reasoning=True, few_shot_examples=None)

Bases: LangChainOpenEvalsMetric

LangChain OpenEvals LLM as a Judge Metric.

A metric that uses LangChain and OpenEvals to evaluate the LLM as a judge.

Available Fields
  • query (str | None, optional): The query / inputs to evaluate.
  • generated_response (str | list[str] | None, optional): The generated response / outputs to evaluate.
  • expected_response (str | list[str] | None, optional): The expected response / reference outputs to evaluate.
  • expected_retrieved_context (str | list[str] | None, optional): The expected retrieved context / reference context.
  • retrieved_context (str | list[str] | None, optional): The list of retrieved contexts.
Scoring
  • 0.0-1.0 (Continuous): An evaluation score assigned by the LLM judge.

Initialize the LangChainOpenEvalsLLMAsAJudgeMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
prompt str

The evaluation prompt, can be a string template, LangChain prompt template, or callable that returns a list of chat messages.

required
model str | ModelId | BaseLMInvoker

The model to use.

required
system str | None

Optional system message to prepend to the prompt.

None
credentials str | None

The credentials to use for the model. Defaults to None.

None
config dict[str, Any] | None

The config to use for the model. Defaults to None.

None
schema ResponseSchema | None

The schema to use for the model. Defaults to None.

None
feedback_key str

Key used to store the evaluation result, defaults to "score".

'score'
continuous bool

If True, score will be a float between 0 and 1. If False, score will be boolean. Defaults to False.

False
choices list[float] | None

Optional list of specific float values the score must be chosen from. Defaults to None.

None
use_reasoning bool

If True, includes explanation for the score in the output. Defaults to True.

True
few_shot_examples list[FewShotExample] | None

Optional list of example evaluations to append to the prompt. Defaults to None.

None

evaluate(data) async

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, checks for custom prompts and processes accordingly. Currently processes items individually; batch optimization can be added in future.

Parameters:

Name Type Description Default
data MetricInput | list[MetricInput]

Single data item or list of data items to evaluate.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

Evaluation results with scores namespaced by metric name.

LangChainOpenEvalsMetric(name, evaluator)

Bases: BaseMetric

LangChain OpenEvals Metric.

A metric that uses LangChain and OpenEvals.

Available Fields
  • query (str | None, optional): The query / inputs to evaluate.
  • generated_response (str | list[str] | None, optional): The generated response / outputs to evaluate.
  • expected_response (str | list[str] | None, optional): The expected response / reference outputs to evaluate.
  • expected_retrieved_context (str | list[str] | None, optional): The expected retrieved context / reference context.
  • retrieved_context (str | list[str] | None, optional): The list of retrieved contexts.
Scoring
  • 0.0-1.0 (Continuous): Depending on the specific OpenEval metric.

Initialize the LangChainOpenEvalsMetric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
evaluator Union[SimpleAsyncEvaluator, Callable[..., Awaitable[Any]]]

The evaluator to use.

required

LanguageConsistencyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: LMBasedMetric

Language Consistency Metric.

This metric is used to evaluate whether the language of the generated response is consistent with the query.

Available Fields
  • query (str): The query.
  • generated_response (list[str]): The generated response.
Scoring
  • 0-1 (Categorical): 0 means not consistent, 1 means fully consistent.
Cookbook Example

Please refer to example_language_consistency.py in the gen-ai-sdk-cookbook repository.

Initialize the LanguageConsistencyMetric class.

Default expected input: - query (str): The query to evaluate the language consistency of the model's output. - generated_response (str): The generated response to evaluate the language consistency of the model's output.

Parameters:

Name Type Description Default
model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

None
prompt_builder PromptBuilder | None

The prompt builder to use for the metric. Defaults to default prompt builder.

None
response_schema ResponseSchema | None

The response schema to use for the metric. Defaults to LanguageConsistencyResponseSchema.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

BATCH_MAX_ITERATIONS

PyTrecMetric(metrics=None, k=20)

Bases: BaseMetric

PyTrec Metric.

A wrapper for pytrec_eval to evaluate common Information Retrieval (IR) metrics. This metric allows you to compute various standard IR scores like NDCG, MAP, MRR, Reciprocal Rank, etc., based on retrieved chunks and ground truth chunk IDs.

Available Fields
  • retrieved_chunks (dict[str, float]): The retrieved chunk ids with their similarity score.
  • ground_truth_chunk_ids (list[str]): The ground truth chunk ids.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better retrieval performance.
Cookbook Example

Please refer to example_pytrec_metric.py in the gen-ai-sdk-cookbook repository.

Initializes the PyTrecMetric.

Parameters:

Name Type Description Default
metrics list[PyTrecEvalMetric | str] | set[PyTrecEvalMetric | str] | None

The metrics to evaluate. Defaults to all metrics.

None
k int | list[int]

The number of retrieved chunks to consider. Defaults to 20.

20

RAGASMetric(metric, name=None, callbacks=None, timeout=None)

Bases: BaseMetric

RAGAS Metric.

RAGAS is a metric for evaluating the quality of RAG systems.

Available Fields
  • query (str): The query to evaluate the metric. Similar to user_input in SingleTurnSample.
  • generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to response in SingleTurnSample. If the generated response is a list, the responses are concatenated into a single string. For multiple responses, use list[str].
  • expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to reference in SingleTurnSample. If the expected response is a list, the responses are concatenated into a single string.
  • expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric. Similar to reference_contexts in SingleTurnSample. If the expected retrieved context is a str, it will be converted into a list with a single element.
  • retrieved_context (str | list[str], optional): The retrieved context to evaluate the metric. Similar to retrieved_contexts in SingleTurnSample. If the retrieved context is a str, it will be converted into a list with a single element.
  • rubrics (dict[str, str], optional): The rubrics to evaluate the metric. Similar to rubrics in SingleTurnSample.
Scoring
  • 0.0-1.0 (Continuous): A score evaluating the RAG aspect being tested.

Initialize the RAGASMetric.

Parameters:

Name Type Description Default
metric SingleTurnMetric

The Ragas metric to use.

required
name str

The name of the metric. Default is the name of the metric.

None
callbacks Callbacks

The callbacks to use. Default is None.

None
timeout int

The timeout for the metric. Default is None.

None

evaluate(data) async

Evaluate with custom prompt lifecycle support.

Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.

For batch processing, uses efficient parallel processing when all items have the same custom prompts. Falls back to per-item processing when items have different custom prompts.

Parameters:

Name Type Description Default
data MetricInput | list[MetricInput]

Single data item or list of data items to evaluate.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

Evaluation results with scores namespaced by metric name.

RagasContextPrecisionWithoutReference(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)

Bases: RAGASMetric

RAGAS Context Precision Metric.

Measures the proportion of relevant chunks in the retrieved contexts without requiring a ground truth reference. It evaluates whether the retrieved context chunks are actually useful for generating the provided response to the user's query.

Available Fields
  • query (str): The query to recall the context for.
  • generated_response (str): The generated response to recall the context for.
  • retrieved_contexts (list[str]): The retrieved contexts to recall the context for.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better context precision.
Cookbook Example

Please refer to example_ragas_context_precision.py in the gen-ai-sdk-cookbook repository.

Initialize the RagasContextPrecisionWithoutReference metric.

Parameters:

Name Type Description Default
lm_model str | ModelId | BaseLMInvoker

The language model to use.

required
lm_model_credentials str | None

The credentials to use for the language model. Default is None.

None
lm_model_config dict[str, Any] | None

The configuration to use for the language model. Default is None.

None
**kwargs

Additional keyword arguments to pass to the RagasContextRecall metric.

{}

RagasContextRecall(lm_model, lm_model_credentials=None, lm_model_config=None, **kwargs)

Bases: RAGASMetric

RAGAS Context Recall Metric.

Measures how many of the relevant documents (or pieces of information) needed to answer the query were successfully retrieved. It evaluates the retrieval system's ability to find all the necessary context based on the generated response and the expected response.

Available Fields
  • query (str): The query to recall the context for.
  • generated_response (str): The generated response to recall the context for.
  • expected_response (str): The expected response to recall the context for.
  • retrieved_contexts (list[str]): The retrieved contexts to recall the context for.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better context recall.
Cookbook Example

Please refer to example_ragas_context_recall.py in the gen-ai-sdk-cookbook repository.

Initialize the RagasContextRecall metric.

Parameters:

Name Type Description Default
lm_model str | ModelId | BaseLMInvoker

The language model to use.

required
lm_model_credentials str | None

The credentials to use for the language model. Default is None.

None
lm_model_config dict[str, Any] | None

The configuration to use for the language model. Default is None.

None
**kwargs

Additional keyword arguments to pass to the RagasContextRecall metric.

{}

RagasFactualCorrectness(lm_model=DefaultValues.MODEL, lm_model_credentials=None, lm_model_config=None, **kwargs)

Bases: RAGASMetric

RAGAS Factual Correctness metric.

This metric evaluates the factual accuracy of the generated response with the reference.

Available Fields
  • query (str): The query.
  • generated_response (str): The generated response.
Scoring
  • 0-1 (Continuous): A higher score indicates better factual correctness.
Cookbook Example

Please refer to example_ragas_factual_correctness.py in the gen-ai-sdk-cookbook repository.

Initialize the RagasFactualCorrectness metric.

Parameters:

Name Type Description Default
lm_model str | ModelId | BaseLMInvoker

The language model to use.

MODEL
lm_model_credentials str | None

The credentials to use for the language model. Default is None.

None
lm_model_config dict[str, Any] | None

The configuration to use for the language model. Default is None.

None
**kwargs

Additional keyword arguments to pass to the RagasFactualCorrectness metric.

{}

RedundancyMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: LMBasedMetric

Redundancy metric.

This metric is used to evaluate the redundancy of the model's output.

Available Fields
  • query (str): The query.
  • generated_response (str): The generated response.
Scoring
  • 1-3 (Continuous): Scale where 1 means no redundancy, 2 is at least one redundancy, and 3 is high redundancy.
Cookbook Example

Please refer to example_redundancy.py in the gen-ai-sdk-cookbook repository.

Initialize the RedundancyMetric class.

Default expected input: - query (str): The query to evaluate the redundancy of the model's output. - generated_response (str): The generated response to evaluate the redundancy of the model's output.

Parameters:

Name Type Description Default
model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

None
prompt_builder PromptBuilder | None

The prompt builder to use for the metric. Defaults to default prompt builder.

None
response_schema ResponseSchema | None

The response schema to use for the metric. Defaults to RedundancyResponseSchema.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

BATCH_MAX_ITERATIONS

RefusalAlignmentMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: LMBasedMetric

Refusal Alignment metric.

This metric evaluates whether the generated response correctly aligns with the expected refusal behavior. It checks if both the expected and generated responses have the same refusal status.

Available Fields
  • query (str): The query to evaluate the metric.
  • expected_response (str): The expected response to evaluate the metric.
  • generated_response (str): The generated response to evaluate the metric.
  • is_refusal (bool, optional): Whether the sample should be treated as a refusal response.
Scoring
  • 0-1 (Categorical): 0 indicates incorrect alignment, 1 indicates correct alignment.
Cookbook Example

Please refer to example_refusal_alignment.py in the gen-ai-sdk-cookbook repository.

Initialize the RefusalAlignmentMetric class.

Parameters:

Name Type Description Default
model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

None
prompt_builder PromptBuilder | None

The prompt builder to use for the metric. Defaults to default prompt builder.

None
response_schema ResponseSchema | None

The response schema to use for the metric. Defaults to RefusalAlignmentResponseSchema.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

BATCH_MAX_ITERATIONS

RefusalMetric(model=DefaultValues.MODEL, model_credentials=None, model_config=None, prompt_builder=None, response_schema=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)

Bases: LMBasedMetric

Refusal metric.

This metric is used to evaluate the refusal of the model's output.

Available Fields
  • query (str): The query.
  • expected_response (str): The expected response.
Scoring
  • 0-1 (Categorical): 0 means not refusal, 1 means refusal.
Cookbook Example

Please refer to example_refusal.py in the gen-ai-sdk-cookbook repository.

Initialize the RefusalMetric class.

Default expected input: - query (str): The query to evaluate the refusal of the model's output. - expected_response (str): The expected response to evaluate the refusal of the model's output.

Parameters:

Name Type Description Default
model Union[str, ModelId, BaseLMInvoker]

The model to use for the metric.

MODEL
model_credentials str | None

The model credentials to use for the metric. Defaults to None.

None
model_config dict[str, Any] | None

The model config to use for the metric. Defaults to an empty dictionary.

None
prompt_builder PromptBuilder | None

The prompt builder to use for the metric. Defaults to default prompt builder.

None
response_schema ResponseSchema | None

The response schema to use for the metric. Defaults to RefusalResponseSchema.

None
batch_status_check_interval float

Time between batch status checks in seconds. Defaults to 30.0.

BATCH_STATUS_CHECK_INTERVAL
batch_max_iterations int

Maximum number of status check iterations before timeout. Defaults to 120.

BATCH_MAX_ITERATIONS

TopKAccuracy(k=20)

Bases: BaseMetric

Top-K Accuracy Metric.

Evaluates whether the ground truth chunk IDs are present within the top K retrieved chunks. This is a boolean-style hit/miss metric averaged over the dataset; a score of 1.0 means the relevant document was always found in the top K results.

Available Fields
  • retrieved_chunks (dict[str, float]): The retrieved chunk ids with their similarity score.
  • ground_truth_chunk_ids (list[str]): The ground truth chunk ids.
Scoring
  • 0.0-1.0 (Continuous): A higher score indicates better top-k accuracy.
Cookbook Example

Please refer to example_top_k_accuracy.py in the gen-ai-sdk-cookbook repository.

Initializes the TopKAccuracy.

Parameters:

Name Type Description Default
k list[int] | int

The number of retrieved chunks to consider. Defaults to 20.

20

top_k_accuracy(qrels, results)

Evaluates the top k accuracy.

Parameters:

Name Type Description Default
qrels dict[str, dict[str, int]]

The ground truth of the retrieved chunks. There are two possible values:

required
- 1

the chunk is relevant to the query.

required
- 0

the chunk is not relevant to the query.

required
results dict[str, dict[str, float]]

The retrieved chunks with their similarity score.

required

Returns:

Type Description
dict[str, float]

dict[str, float]: The top k accuracy.

Example
qrels = {
    "q1": {"chunk1": 1, "chunk2": 1},
}
results = {"q1": {"chunk1": 0.9, "chunk2": 0.8, "chunk3": 0.7}}