Skip to content

Generation evaluator

Generation Evaluator.

An evaluator for evaluating generation tasks. e.g. - Evaluation RAG output - Evaluation LLM output - Evaluation AI Agent output

Authors

Surya Mahadi (made.r.s.mahadi@gdplabs.id)

References

NONE

GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)

Bases: BaseEvaluator

Evaluator for generation tasks.

Default expected input
  • query (str): The query to evaluate the completeness of the model's output.
  • retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
  • expected_response (str): The expected response to evaluate the completeness of the model's output.
  • generated_response (str): The generated response to evaluate the completeness of the model's output.

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics List[BaseMetric]

The list of metrics to evaluate.

run_parallel bool

Whether to run the metrics in parallel.

rule_book RuleBook | None

The rule book.

generation_rule_engine GenerationRuleEngine | None

The generation rule engine.

judge MultipleLLMAsJudge | None

Optional multiple LLM judge for ensemble evaluation.

Initialize the GenerationEvaluator.

Parameters:

Name Type Description Default
metrics List[BaseMetric] | None

A list of metric instances to use as a base pool. If None, defaults to [CompletenessMetric, RedundancyMetric, GroundednessMetric]. Each custom metrics, must generate a score key in the output.

None
enabled_metrics List[type[BaseMetric] | str] | None

A list of metric classes or names to enable from the pool. If None, all metrics in the pool are used.

None
model str | ModelId | BaseLMInvoker

The model to use for the metrics.

MODEL
model_config dict[str, Any] | None

The model config to use for the metrics.

None
model_credentials str | None

The model credentials, used for initializing default metrics. Defaults to None. This is required if some of the default metrics are used.

None
run_parallel bool

Whether to run the metrics in parallel. Defaults to False.

True
rule_book RuleBook | None

The rule book for evaluation. If not provided, a default one is generated, but only if all enabled metrics are from the default set.

None
generation_rule_engine GenerationRuleEngine | None

The generation rule engine. Defaults to a new instance with the determined rule book.

None
judge MultipleLLMAsJudge | None

Optional multiple LLM judge for ensemble evaluation. If provided, will use multiple judges instead of single model evaluation. Uses composition pattern for clean separation of concerns.

None

Raises:

Type Description
ValueError

If model_credentials is not provided when using default metrics.

ValueError

If a custom rule_book is provided when using custom metrics or a mix of custom and default metrics without an explicit rule_book.

GenerationRuleEngine(rules)

Classify metric dictionaries using a injected RuleBook.

Initialize the GenerationRuleEngine.

Parameters:

Name Type Description Default
rules RuleBook

The rules to use for classification.

required

from_default_rule_book(enabled_metrics=None) staticmethod

Get the default rule book.

Returns:

Name Type Description
GenerationRuleEngine GenerationRuleEngine

The default rule book.

get_adaptive_rule_book(available_metrics) staticmethod

Generate rule book based on available metrics.

Parameters:

Name Type Description Default
available_metrics List[str]

The list of available metrics.

required

Returns:

Name Type Description
RuleBook RuleBook

The adaptive rule book.

get_default_bad_rule(enabled_metrics=None) staticmethod

Get the bad rule.

Parameters:

Name Type Description Default
enabled_metrics List[str] | None

A list of metric names to include. If None, all are included.

None

Returns:

Name Type Description
MetricSpec MetricSpec

The bad rule.

get_default_good_rule(enabled_metrics=None) staticmethod

Get the good rule.

Parameters:

Name Type Description Default
enabled_metrics List[str] | None

A list of metric names to include. If None, all are included.

None

Returns:

Name Type Description
MetricSpec MetricSpec

The good rule.

get_default_issue_rules(enabled_metrics=None) staticmethod

Get the issue rules.

Parameters:

Name Type Description Default
enabled_metrics List[str] | None

A list of metric names to include. If None, all are included.

None

Returns:

Type Description
Mapping[Issue, MetricSpec]

Mapping[Issue, MetricSpec]: The issue rules.

get_default_rule_book(enabled_metrics=None) staticmethod

Get the default rule book.

Parameters:

Name Type Description Default
enabled_metrics List[str] | None

A list of metric names to include. If None, all are included.

None

Returns:

Name Type Description
RuleBook RuleBook

The default rule book.

infer(metrics)

Return a dict ready for CSV/Sheets export.

Parameters:

Name Type Description Default
metrics Mapping[str, Any]

The metrics to classify.

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: The classified metrics.

Issue

Bases: StrEnum

Enumeration of potential issues with the model's output.

Attributes:

Name Type Description
RETRIEVAL_ISSUE str

Indicates a potential issue with the retrieval process.

GENERATION_ISSUE str

Indicates a potential issue with the generation process.

Relevancy

Bases: StrEnum

Enumeration of relevance conditions.

Attributes:

Name Type Description
GOOD str

Indicates that the model's output meets the good relevance conditions.

BAD str

Indicates that the model's output meets the bad relevance conditions.

INCOMPLETE str

Indicates that the model's output is incomplete or requires manual evaluation.

NEED_MANUAL_EVAL str

Indicates that the model's output requires manual evaluation.

RuleBook(good, bad, issue_rules) dataclass

Bundle of good and bad composite specs.

Both are plain MetricSpec objects, so you can build them with & / | / ~ or any other sutoppu combinator.

Attributes:

Name Type Description
good MetricSpec

The good rule.

bad MetricSpec

The bad rule.