Metric
Base class for metrics.
BaseMetric
Bases: ABC
Abstract class for metrics.
This class defines the interface for all metrics.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the metric. |
required_fields |
set[str]
|
The required fields for this metric to evaluate data. |
input_type |
type | None
|
The type of the input data. |
Example
Adding custom prompts to existing evaluator metrics:
from gllm_evals import load_simple_qa_dataset
from gllm_evals.evaluate import evaluate
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.utils.shared_functionality import inference_fn
async def main():
# Main function with custom prompts
# Load your dataset
dataset = load_simple_qa_dataset()
# Create evaluator with default metrics
evaluator = GEvalGenerationEvaluator(
model_credentials=os.getenv("GOOGLE_API_KEY")
)
# Add custom prompts polymorphically (works for any metric)
for metric in evaluator.metrics:
if hasattr(metric, 'name'): # Ensure metric has name attribute
# Add custom prompts based on metric name
if metric.name == "geval_completeness":
# Add domain-specific few-shot examples
metric.additional_context += "\n\nCUSTOM MEDICAL EXAMPLES: ..."
elif metric.name == "geval_groundedness":
# Add grounding examples
metric.additional_context += "\n\nMEDICAL GROUNDING EXAMPLES: ..."
# Evaluate with custom prompts applied automatically
results = await evaluate(
data=dataset,
inference_fn=inference_fn,
evaluators=[evaluator], # ← Custom prompts applied to metrics
)
can_evaluate(data)
Check if this metric can evaluate the given data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput
|
The input data to check. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
True if the metric can evaluate the data, False otherwise. |
evaluate(data)
async
Evaluate the metric on the given dataset (single item or batch).
Automatically handles batch processing by default. Subclasses can override
_evaluate to accept lists for optimized batch processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput | list[MetricInput]
|
The data to evaluate the metric on. Can be a single item or a list for batch processing. |
required |
Returns:
| Type | Description |
|---|---|
MetricOutput | list[MetricOutput]
|
MetricOutput | list[MetricOutput]: A dictionary where the key are the namespace and the value are the scores. Returns a list if input is a list. |
get_input_fields()
classmethod
Return declared input field names if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[str] | None
|
list[str] | None: The input fields. |
get_input_spec()
classmethod
Return structured spec for input fields if input_type is provided; otherwise None.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]] | None
|
list[dict[str, Any]] | None: The input spec. |
get_normalized_score(raw_score)
Normalize raw score to 0-1 range based on metric's good_score and bad_score.
This method handles both: - Different scales (e.g., 1-3 for completeness, 0-1 for language_consistency) - Inverted scales (e.g., redundancy where lower is better)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_score
|
float
|
The raw score value from the metric evaluation. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
float |
float
|
Normalized score between 0 and 1, where 1 is best and 0 is worst. |
Examples:
>>> # Completeness: good=3, bad=1 (higher is better)
>>> metric.get_normalized_score(2) # Returns 0.5
>>> # Redundancy: good=1, bad=3 (lower is better)
>>> metric.get_normalized_score(2) # Returns 0.5 (inverted)
>>> # Language Consistency: good=1, bad=0 (already 0-1)
>>> metric.get_normalized_score(0.5) # Returns 0.5