Generation evaluator

Generation Evaluator.

An evaluator for evaluating generation tasks. e.g. - Evaluation RAG output - Evaluation LLM output - Evaluation AI Agent output

Authors

Surya Mahadi (made.r.s.mahadi@gdplabs.id)

References

NONE

`GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)`

Bases: BaseEvaluator

Evaluator for generation tasks.

Default expected input

query (str): The query to evaluate the completeness of the model's output.
retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
expected_response (str): The expected response to evaluate the completeness of the model's output.
generated_response (str): The generated response to evaluate the completeness of the model's output.

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`metrics`	`List[BaseMetric]`	The list of metrics to evaluate.
`run_parallel`	`bool`	Whether to run the metrics in parallel.
`rule_book`	`RuleBook \| None`	The rule book.
`generation_rule_engine`	`GenerationRuleEngine \| None`	The generation rule engine.
`judge`	`MultipleLLMAsJudge \| None`	Optional multiple LLM judge for ensemble evaluation.

Initialize the GenerationEvaluator.

Parameters:

Name	Type	Description	Default
`metrics`	`List[BaseMetric] \| None`	A list of metric instances to use as a base pool. If None, defaults to `[CompletenessMetric, RedundancyMetric, GroundednessMetric]`. Each custom metrics, must generate a `score` key in the output.	`None`
`enabled_metrics`	`List[type[BaseMetric] \| str] \| None`	A list of metric classes or names to enable from the pool. If None, all metrics in the pool are used.	`None`
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metrics.	`MODEL`
`model_config`	`dict[str, Any] \| None`	The model config to use for the metrics.	`None`
`model_credentials`	`str \| None`	The model credentials, used for initializing default metrics. Defaults to None. This is required if some of the default metrics are used.	`None`
`run_parallel`	`bool`	Whether to run the metrics in parallel. Defaults to False.	`True`
`rule_book`	`RuleBook \| None`	The rule book for evaluation. If not provided, a default one is generated, but only if all enabled metrics are from the default set.	`None`
`generation_rule_engine`	`GenerationRuleEngine \| None`	The generation rule engine. Defaults to a new instance with the determined rule book.	`None`
`judge`	`MultipleLLMAsJudge \| None`	Optional multiple LLM judge for ensemble evaluation. If provided, will use multiple judges instead of single model evaluation. Uses composition pattern for clean separation of concerns.	`None`

Raises:

Type	Description
`ValueError`	If `model_credentials` is not provided when using default metrics.
`ValueError`	If a custom `rule_book` is provided when using custom metrics or a mix of custom and default metrics without an explicit rule_book.

`GenerationRuleEngine(rules)`

Classify metric dictionaries using a injected RuleBook.

Initialize the GenerationRuleEngine.

Parameters:

Name	Type	Description	Default
`rules`	`RuleBook`	The rules to use for classification.	required

`from_default_rule_book(enabled_metrics=None)` `staticmethod`

Get the default rule book.

Returns:

Name	Type	Description
`GenerationRuleEngine`	`GenerationRuleEngine`	The default rule book.

`get_adaptive_rule_book(available_metrics)` `staticmethod`

Generate rule book based on available metrics.

Parameters:

Name	Type	Description	Default
`available_metrics`	`List[str]`	The list of available metrics.	required

Returns:

Name	Type	Description
`RuleBook`	`RuleBook`	The adaptive rule book.

`get_default_bad_rule(enabled_metrics=None)` `staticmethod`

Get the bad rule.

Parameters:

Name	Type	Description	Default
`enabled_metrics`	`List[str] \| None`	A list of metric names to include. If None, all are included.	`None`

Returns:

Name	Type	Description
`MetricSpec`	`MetricSpec`	The bad rule.

`get_default_good_rule(enabled_metrics=None)` `staticmethod`

Get the good rule.

Parameters:

Name	Type	Description	Default
`enabled_metrics`	`List[str] \| None`	A list of metric names to include. If None, all are included.	`None`

Returns:

Name	Type	Description
`MetricSpec`	`MetricSpec`	The good rule.

`get_default_issue_rules(enabled_metrics=None)` `staticmethod`

Get the issue rules.

Parameters:

Name	Type	Description	Default
`enabled_metrics`	`List[str] \| None`	A list of metric names to include. If None, all are included.	`None`

Returns:

Type	Description
`Mapping[Issue, MetricSpec]`	Mapping[Issue, MetricSpec]: The issue rules.

`get_default_rule_book(enabled_metrics=None)` `staticmethod`

Get the default rule book.

Parameters:

Name	Type	Description	Default
`enabled_metrics`	`List[str] \| None`	A list of metric names to include. If None, all are included.	`None`

Returns:

Name	Type	Description
`RuleBook`	`RuleBook`	The default rule book.

`infer(metrics)`

Return a dict ready for CSV/Sheets export.

Parameters:

Name	Type	Description	Default
`metrics`	`Mapping[str, Any]`	The metrics to classify.	required

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: The classified metrics.

`Issue`

Bases: StrEnum

Enumeration of potential issues with the model's output.

Attributes:

Name	Type	Description
`RETRIEVAL_ISSUE`	`str`	Indicates a potential issue with the retrieval process.
`GENERATION_ISSUE`	`str`	Indicates a potential issue with the generation process.

`Relevancy`

Bases: StrEnum

Enumeration of relevance conditions.

Attributes:

Name	Type	Description
`GOOD`	`str`	Indicates that the model's output meets the good relevance conditions.
`BAD`	`str`	Indicates that the model's output meets the bad relevance conditions.
`INCOMPLETE`	`str`	Indicates that the model's output is incomplete or requires manual evaluation.
`NEED_MANUAL_EVAL`	`str`	Indicates that the model's output requires manual evaluation.

`RuleBook(good, bad, issue_rules)` `dataclass`

Bundle of good and bad composite specs.

Both are plain MetricSpec objects, so you can build them with & / | / ~ or any other sutoppu combinator.

Attributes:

Name	Type	Description
`good`	`MetricSpec`	The good rule.
`bad`	`MetricSpec`	The bad rule.

Generation evaluator

GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)

GenerationRuleEngine(rules)

from_default_rule_book(enabled_metrics=None) staticmethod

get_adaptive_rule_book(available_metrics) staticmethod

get_default_bad_rule(enabled_metrics=None) staticmethod

get_default_good_rule(enabled_metrics=None) staticmethod

get_default_issue_rules(enabled_metrics=None) staticmethod

get_default_rule_book(enabled_metrics=None) staticmethod

infer(metrics)

Issue

Relevancy

RuleBook(good, bad, issue_rules) dataclass

`GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)`

`GenerationRuleEngine(rules)`

`from_default_rule_book(enabled_metrics=None)` `staticmethod`

`get_adaptive_rule_book(available_metrics)` `staticmethod`

`get_default_bad_rule(enabled_metrics=None)` `staticmethod`

`get_default_good_rule(enabled_metrics=None)` `staticmethod`

`get_default_issue_rules(enabled_metrics=None)` `staticmethod`

`get_default_rule_book(enabled_metrics=None)` `staticmethod`

`infer(metrics)`

`Issue`

`Relevancy`

`RuleBook(good, bad, issue_rules)` `dataclass`