Rag evaluator
RAG Evaluator.
This evaluator combines retrieval and generation evaluation for RAG pipelines using
- LMBasedRetrievalEvaluator for retrieval quality
- GEvalGenerationEvaluator for generation quality
It applies a customizable rule-based combiner over both retrieval and generation metrics to derive an overall RAG rating and explanation. By default, it uses the rule base from the generation evaluator.
HybridRuleBook(good, bad, issue_rules)
dataclass
Rule book that supports both MetricSpec (int) and FloatMetricSpec (float).
This allows combining generation metrics (int-based) and retrieval metrics (float-based) in a single rule book.
Attributes:
| Name | Type | Description |
|---|---|---|
good |
Specification
|
The good rule (can be MetricSpec, FloatMetricSpec, or combination). |
bad |
Specification
|
The bad rule (can be MetricSpec, FloatMetricSpec, or combination). |
issue_rules |
Mapping[Issue, Specification]
|
Issue detection rules. |
HybridRuleEngine(rules)
Bases: BaseRuleEngine[HybridRuleBook, Specification, Relevancy]
Rule engine that handles both int-based (MetricSpec) and float-based (FloatMetricSpec) metrics.
This engine can evaluate rules that combine generation metrics (int 0-4) and retrieval metrics (float 0.0-1.0) in a single rule book.
Initialize the HybridRuleEngine.
RAGEvaluator(retrieval_evaluator=None, generation_evaluator=None, retrieval_metrics=None, generation_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, rule_engine=None, judge=None, refusal_metric=None)
Bases: BaseEvaluator
Evaluator for RAG pipelines combining retrieval and generation evaluation.
This evaluator
- Runs retrieval evaluation using LMBasedRetrievalEvaluator
- Runs generation evaluation using GEvalGenerationEvaluator
- Combines their scores using a customizable rule-based scheme to produce:
relevancy_rating(good / bad / incomplete)score(aggregated RAG score)possible_issues(list of textual issues)
Important Note on Rule Engine:
By default, this evaluator uses GenerationRuleEngine with RuleBook which works
with generation metrics (int scores 0-4). To include retrieval metrics in rule-based
classification, use HybridRuleBook with HybridRuleEngine, which supports both
MetricSpec (int-based) and FloatMetricSpec (float-based) metrics.
Default expected input
- query (str): The query to evaluate.
- expected_response (str): The expected response.
- retrieved_context (str | list[str]): The retrieved contexts.
- generated_response (str): The generated response.
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
The name of the evaluator. |
retrieval_evaluator |
LMBasedRetrievalEvaluator
|
The retrieval evaluator. |
generation_evaluator |
GEvalGenerationEvaluator
|
The generation evaluator. |
rule_book |
RuleBook | HybridRuleBook | None
|
The rule book for evaluation (uses generation metrics by default).
Use |
rule_engine |
GenerationRuleEngine | HybridRuleEngine | None
|
The rule engine for classification.
Uses |
run_parallel |
bool
|
Whether to run retrieval and generation evaluations in parallel. |
Initialize the RAG evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retrieval_evaluator
|
LMBasedRetrievalEvaluator | None
|
Pre-configured retrieval evaluator. If provided, this will be used directly and retrieval_* parameters will be ignored. Defaults to None. |
None
|
generation_evaluator
|
GEvalGenerationEvaluator | None
|
Pre-configured generation evaluator. If provided, this will be used directly and generation_* parameters will be ignored. Defaults to None. |
None
|
retrieval_metrics
|
Sequence[BaseMetric] | None
|
Optional custom retrieval metric instances. Used only if retrieval_evaluator is None. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None. |
None
|
generation_metrics
|
Sequence[BaseMetric] | None
|
Optional custom generation metric instances. Used only if generation_evaluator is None. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None. |
None
|
model
|
str | ModelId | BaseLMInvoker
|
Model for the default metrics. Used only if evaluators
are None. Defaults to |
MODEL
|
model_credentials
|
str | None
|
Credentials for the model, required when |
None
|
model_config
|
dict[str, Any] | None
|
Optional model configuration. Used only if evaluators are None. Defaults to None. |
None
|
run_parallel
|
bool
|
Whether to run retrieval and generation evaluations in parallel. Used only if evaluators are None. Defaults to True. |
True
|
rule_book
|
RuleBook | HybridRuleBook | None
|
The rule book for evaluation. If not provided,
a default one is generated based on enabled generation metrics. Use |
None
|
rule_engine
|
GenerationRuleEngine | HybridRuleEngine | None
|
The rule engine for classification.
If not provided, a new instance is created with the determined rule book. Use |
None
|
judge
|
MultipleLLMAsJudge | None
|
Optional multiple LLM judge for ensemble evaluation. Used only if generation_evaluator is None. Defaults to None. |
None
|
refusal_metric
|
type[BaseMetric] | None
|
The refusal metric to use for generation evaluator. Used only if generation_evaluator is None. If None, the default refusal metric will be used. Defaults to None. |
None
|
required_fields
property
Return the union of required fields from both evaluators.