Metric

Base class for metrics.

`BaseMetric`

Bases: ABC

Abstract class for metrics.

This class defines the interface for all metrics.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`required_fields`	`set[str]`	The required fields for this metric to evaluate data.
`input_type`	`type \| None`	The type of the input data.

Example

Adding custom prompts to existing evaluator metrics:

from gllm_evals import load_simple_qa_dataset
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.utils.shared_functionality import inference_fn


async def main():
    # Main function with custom prompts

    # Load your dataset
    dataset = load_simple_qa_dataset()

    # Create evaluator with default metrics
    evaluator = GEvalGenerationEvaluator(
        model_credentials=os.getenv("GOOGLE_API_KEY")
    )

    # Add custom prompts polymorphically (works for any metric)
    for metric in evaluator.metrics:
        if hasattr(metric, 'name'):  # Ensure metric has name attribute
            # Add custom prompts based on metric name
            if metric.name == "geval_completeness":
                # Add domain-specific few-shot examples
                metric.additional_context += "\n\nCUSTOM MEDICAL EXAMPLES: ..."
            elif metric.name == "geval_groundedness":
                # Add grounding examples
                metric.additional_context += "\n\nMEDICAL GROUNDING EXAMPLES: ..."

    # Evaluate with custom prompts applied automatically
    results = await evaluate(
        data=dataset,
        inference_fn=inference_fn,
        evaluators=[evaluator],  # ← Custom prompts applied to metrics
    )

`can_evaluate(data)`

Check if this metric can evaluate the given data.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput`	The input data to check.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if the metric can evaluate the data, False otherwise.

`evaluate(data)` `async`

Evaluate the metric on the given dataset (single item or batch).

Automatically handles batch processing by default. Subclasses can override _evaluate to accept lists for optimized batch processing.

Parameters:

Name	Type	Description	Default
`data`	`MetricInput \| list[MetricInput]`	The data to evaluate the metric on. Can be a single item or a list for batch processing.	required

Returns:

Type	Description
`MetricOutput \| list[MetricOutput]`	MetricOutput \| list[MetricOutput]: A dictionary where the key are the namespace and the value are the scores. Returns a list if input is a list.

`get_input_fields()` `classmethod`

Return declared input field names if input_type is provided; otherwise None.

Returns:

Type	Description
`list[str] \| None`	list[str] \| None: The input fields.

`get_input_spec()` `classmethod`

Return structured spec for input fields if input_type is provided; otherwise None.

Returns:

Type	Description
`list[dict[str, Any]] \| None`	list[dict[str, Any]] \| None: The input spec.

`get_normalized_score(raw_score)`

Normalize raw score to 0-1 range based on metric's good_score and bad_score.

This method handles both: - Different scales (e.g., 1-3 for completeness, 0-1 for language_consistency) - Inverted scales (e.g., redundancy where lower is better)

Parameters:

Name	Type	Description	Default
`raw_score`	`float`	The raw score value from the metric evaluation.	required

Returns:

Name	Type	Description
`float`	`float`	Normalized score between 0 and 1, where 1 is best and 0 is worst.

Examples:

>>> # Completeness: good=3, bad=1 (higher is better)
>>> metric.get_normalized_score(2)  # Returns 0.5
>>> # Redundancy: good=1, bad=3 (lower is better)
>>> metric.get_normalized_score(2)  # Returns 0.5 (inverted)
>>> # Language Consistency: good=1, bad=0 (already 0-1)
>>> metric.get_normalized_score(0.5)  # Returns 0.5

Metric

BaseMetric

can_evaluate(data)

evaluate(data) async

get_input_fields() classmethod

get_input_spec() classmethod

get_normalized_score(raw_score)

`BaseMetric`

`can_evaluate(data)`

`evaluate(data)` `async`

`get_input_fields()` `classmethod`

`get_input_spec()` `classmethod`

`get_normalized_score(raw_score)`