Skip to content

Metric

Base class for metrics.

BaseMetric

Bases: ABC

Abstract class for metrics.

This class defines the interface for all metrics.

Attributes:

Name Type Description
name str

The name of the metric.

required_fields set[str]

The required fields for this metric to evaluate data.

input_type type | None

The type of the input data.

Example

Adding custom prompts to existing evaluator metrics:

from gllm_evals import load_simple_qa_dataset
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.utils.shared_functionality import inference_fn


async def main():
    # Main function with custom prompts

    # Load your dataset
    dataset = load_simple_qa_dataset()

    # Create evaluator with default metrics
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Add custom prompts polymorphically (works for any metric)
    for metric in evaluator.metrics:
        if hasattr(metric, 'name'):  # Ensure metric has name attribute
            # Add custom prompts based on metric name
            if metric.name == "geval_completeness":
                # Add domain-specific few-shot examples
                metric.additional_context += "\n\nCUSTOM MEDICAL EXAMPLES: ..."
            elif metric.name == "geval_groundedness":
                # Add grounding examples
                metric.additional_context += "\n\nMEDICAL GROUNDING EXAMPLES: ..."

    # Evaluate with custom prompts applied automatically
    results = await evaluate(
        data=dataset,
        inference_fn=inference_fn,
        evaluators=[evaluator],  # ← Custom prompts applied to metrics
    )

can_evaluate(data)

Check if this metric can evaluate the given data.

Parameters:

Name Type Description Default
data MetricInput

The input data to check.

required

Returns:

Name Type Description
bool bool

True if the metric can evaluate the data, False otherwise.

evaluate(data) async

Evaluate the metric on the given dataset (single item or batch).

Automatically handles batch processing by default. Subclasses can override _evaluate to accept lists for optimized batch processing.

Parameters:

Name Type Description Default
data MetricInput | list[MetricInput]

The data to evaluate the metric on. Can be a single item or a list for batch processing.

required

Returns:

Type Description
MetricOutput | list[MetricOutput]

MetricOutput | list[MetricOutput]: A dictionary where the key are the namespace and the value are the scores. Returns a list if input is a list.

get_input_fields() classmethod

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type Description
list[str] | None

list[str] | None: The input fields.

get_input_spec() classmethod

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type Description
list[dict[str, Any]] | None

list[dict[str, Any]] | None: The input spec.

get_normalized_score(raw_score)

Normalize raw score to 0-1 range based on metric's good_score and bad_score.

This method handles both: - Different scales (e.g., 1-3 for completeness, 0-1 for language_consistency) - Inverted scales (e.g., redundancy where lower is better)

Parameters:

Name Type Description Default
raw_score float

The raw score value from the metric evaluation.

required

Returns:

Name Type Description
float float

Normalized score between 0 and 1, where 1 is best and 0 is worst.

Examples:

>>> # Completeness: good=3, bad=1 (higher is better)
>>> metric.get_normalized_score(2)  # Returns 0.5
>>> # Redundancy: good=1, bad=3 (lower is better)
>>> metric.get_normalized_score(2)  # Returns 0.5 (inverted)
>>> # Language Consistency: good=1, bad=0 (already 0-1)
>>> metric.get_normalized_score(0.5)  # Returns 0.5