Rag evaluator
RAG Evaluator.
This evaluator combines retrieval and generation evaluation for RAG pipelines using
- LMBasedRetrievalEvaluator for retrieval quality
- GEvalGenerationEvaluator for generation quality
It applies a customizable rule-based combiner over both retrieval and generation metrics to derive an overall RAG rating and explanation. By default, it uses the rule base from the generation evaluator.
References
NONE
HybridRuleBook(good, bad, issue_rules)
dataclass
Rule book that supports both MetricSpec (int) and FloatMetricSpec (float).
This allows combining generation metrics (int-based) and retrieval metrics (float-based) in a single rule book.
Attributes:
| Name | Type | Description |
|---|---|---|
good |
Specification
|
The good rule (can be MetricSpec, FloatMetricSpec, or combination). |
bad |
Specification
|
The bad rule (can be MetricSpec, FloatMetricSpec, or combination). |
issue_rules |
Mapping[Issue, Specification]
|
Issue detection rules. |
HybridRuleEngine(rules)
Bases: BaseRuleEngine[HybridRuleBook, Specification, Relevancy]
Rule engine that handles both int-based (MetricSpec) and float-based (FloatMetricSpec) metrics.
This engine can evaluate rules that combine generation metrics (int 0-4) and retrieval metrics (float 0.0-1.0) in a single rule book.
Initialize the HybridRuleEngine.
RAGEvaluator(retrieval_evaluator=None, generation_evaluator=None, retrieval_metrics=None, generation_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, rule_engine=None, judge=None, refusal_metric=None)
Bases: BaseEvaluator
Evaluator for RAG pipelines combining retrieval and generation evaluation.
This evaluator
- Runs retrieval evaluation using LMBasedRetrievalEvaluator
- Runs generation evaluation using GEvalGenerationEvaluator
- Combines their scores using a customizable rule-based scheme to produce:
relevancy_rating(good / bad / incomplete)score(aggregated RAG score)possible_issues(list of textual issues)
Important Note on Rule Engine:
By default, this evaluator uses GenerationRuleEngine with RuleBook which works
with generation metrics (int scores 0-4). To include retrieval metrics in rule-based
classification, use HybridRuleBook with HybridRuleEngine, which supports both
MetricSpec (int-based) and FloatMetricSpec (float-based) metrics.
Default expected input
- query (str): The query to evaluate.
- expected_response (str): The expected response.
- retrieved_context (str | list[str]): The retrieved contexts.
- generated_response (str): The generated response.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
retrieval_evaluator |
LMBasedRetrievalEvaluator
|
The retrieval evaluator. |
generation_evaluator |
GEvalGenerationEvaluator
|
The generation evaluator. |
rule_book |
RuleBook | HybridRuleBook | None
|
The rule book for evaluation (uses generation metrics by default).
Use |
rule_engine |
GenerationRuleEngine | HybridRuleEngine | None
|
The rule engine for classification.
Uses |
run_parallel |
bool
|
Whether to run retrieval and generation evaluations in parallel. |
Initialize the RAG evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retrieval_evaluator
|
LMBasedRetrievalEvaluator | None
|
Pre-configured retrieval evaluator. If provided, this will be used directly and retrieval_* parameters will be ignored. Defaults to None. |
None
|
generation_evaluator
|
GEvalGenerationEvaluator | None
|
Pre-configured generation evaluator. If provided, this will be used directly and generation_* parameters will be ignored. Defaults to None. |
None
|
retrieval_metrics
|
Sequence[BaseMetric] | None
|
Optional custom retrieval metric instances. Used only if retrieval_evaluator is None. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None. |
None
|
generation_metrics
|
Sequence[BaseMetric] | None
|
Optional custom generation metric instances. Used only if generation_evaluator is None. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None. |
None
|
model
|
str | ModelId | BaseLMInvoker
|
Model for the default metrics. Used only if evaluators
are None. Defaults to |
MODEL
|
model_credentials
|
str | None
|
Credentials for the model, required when |
None
|
model_config
|
dict[str, Any] | None
|
Optional model configuration. Used only if evaluators are None. Defaults to None. |
None
|
run_parallel
|
bool
|
Whether to run retrieval and generation evaluations in parallel. Used only if evaluators are None. Defaults to True. |
True
|
rule_book
|
RuleBook | HybridRuleBook | None
|
The rule book for evaluation. If not provided,
a default one is generated based on enabled generation metrics. Use |
None
|
rule_engine
|
GenerationRuleEngine | HybridRuleEngine | None
|
The rule engine for classification.
If not provided, a new instance is created with the determined rule book. Use |
None
|
judge
|
MultipleLLMAsJudge | None
|
Optional multiple LLM judge for ensemble evaluation. Used only if generation_evaluator is None. Defaults to None. |
None
|
refusal_metric
|
type[BaseMetric] | None
|
The refusal metric to use for generation evaluator. Used only if generation_evaluator is None. If None, the default refusal metric will be used. Defaults to None. |
None
|
required_fields
property
Return the union of required fields from both evaluators.