Generation evaluator
Generation Evaluator.
An evaluator for evaluating generation tasks. e.g. - Evaluation RAG output - Evaluation LLM output - Evaluation AI Agent output
References
NONE
GenerationEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, generation_rule_engine=None, judge=None)
Bases: BaseEvaluator
Evaluator for generation tasks.
Default expected input
- query (str): The query to evaluate the completeness of the model's output.
- retrieved_context (str): The retrieved context to evaluate the groundedness of the model's output.
- expected_response (str): The expected response to evaluate the completeness of the model's output.
- generated_response (str): The generated response to evaluate the completeness of the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
metrics |
List[BaseMetric]
|
The list of metrics to evaluate. |
run_parallel |
bool
|
Whether to run the metrics in parallel. |
rule_book |
RuleBook | None
|
The rule book. |
generation_rule_engine |
GenerationRuleEngine | None
|
The generation rule engine. |
judge |
MultipleLLMAsJudge | None
|
Optional multiple LLM judge for ensemble evaluation. |
Initialize the GenerationEvaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
List[BaseMetric] | None
|
A list of metric instances to use as a base pool.
If None, defaults to |
None
|
enabled_metrics |
List[type[BaseMetric] | str] | None
|
A list of metric classes or names to enable from the pool. If None, all metrics in the pool are used. |
None
|
model |
str | ModelId | BaseLMInvoker
|
The model to use for the metrics. |
MODEL
|
model_config |
dict[str, Any] | None
|
The model config to use for the metrics. |
None
|
model_credentials |
str | None
|
The model credentials, used for initializing default metrics. Defaults to None. This is required if some of the default metrics are used. |
None
|
run_parallel |
bool
|
Whether to run the metrics in parallel. Defaults to False. |
True
|
rule_book |
RuleBook | None
|
The rule book for evaluation. If not provided, a default one is generated, but only if all enabled metrics are from the default set. |
None
|
generation_rule_engine |
GenerationRuleEngine | None
|
The generation rule engine. Defaults to a new instance with the determined rule book. |
None
|
judge |
MultipleLLMAsJudge | None
|
Optional multiple LLM judge for ensemble evaluation. If provided, will use multiple judges instead of single model evaluation. Uses composition pattern for clean separation of concerns. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If a custom |
GenerationRuleEngine(rules)
Classify metric dictionaries using a injected RuleBook.
Initialize the GenerationRuleEngine.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rules |
RuleBook
|
The rules to use for classification. |
required |
from_default_rule_book(enabled_metrics=None)
staticmethod
Get the default rule book.
Returns:
| Name | Type | Description |
|---|---|---|
GenerationRuleEngine |
GenerationRuleEngine
|
The default rule book. |
get_adaptive_rule_book(available_metrics)
staticmethod
Generate rule book based on available metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
available_metrics |
List[str]
|
The list of available metrics. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
RuleBook |
RuleBook
|
The adaptive rule book. |
get_default_bad_rule(enabled_metrics=None)
staticmethod
Get the bad rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
enabled_metrics |
List[str] | None
|
A list of metric names to include. If None, all are included. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
MetricSpec |
MetricSpec
|
The bad rule. |
get_default_good_rule(enabled_metrics=None)
staticmethod
Get the good rule.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
enabled_metrics |
List[str] | None
|
A list of metric names to include. If None, all are included. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
MetricSpec |
MetricSpec
|
The good rule. |
get_default_issue_rules(enabled_metrics=None)
staticmethod
Get the issue rules.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
enabled_metrics |
List[str] | None
|
A list of metric names to include. If None, all are included. |
None
|
Returns:
| Type | Description |
|---|---|
Mapping[Issue, MetricSpec]
|
Mapping[Issue, MetricSpec]: The issue rules. |
get_default_rule_book(enabled_metrics=None)
staticmethod
Get the default rule book.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
enabled_metrics |
List[str] | None
|
A list of metric names to include. If None, all are included. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
RuleBook |
RuleBook
|
The default rule book. |
infer(metrics)
Return a dict ready for CSV/Sheets export.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics |
Mapping[str, Any]
|
The metrics to classify. |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dict[str, Any]: The classified metrics. |
Issue
Bases: StrEnum
Enumeration of potential issues with the model's output.
Attributes:
| Name | Type | Description |
|---|---|---|
RETRIEVAL_ISSUE |
str
|
Indicates a potential issue with the retrieval process. |
GENERATION_ISSUE |
str
|
Indicates a potential issue with the generation process. |
Relevancy
Bases: StrEnum
Enumeration of relevance conditions.
Attributes:
| Name | Type | Description |
|---|---|---|
GOOD |
str
|
Indicates that the model's output meets the good relevance conditions. |
BAD |
str
|
Indicates that the model's output meets the bad relevance conditions. |
INCOMPLETE |
str
|
Indicates that the model's output is incomplete or requires manual evaluation. |
NEED_MANUAL_EVAL |
str
|
Indicates that the model's output requires manual evaluation. |
RuleBook(good, bad, issue_rules)
dataclass
Bundle of good and bad composite specs.
Both are plain MetricSpec objects, so you can build them with & / | / ~
or any other sutoppu combinator.
Attributes:
| Name | Type | Description |
|---|---|---|
good |
MetricSpec
|
The good rule. |
bad |
MetricSpec
|
The bad rule. |