Skip to content

Rag evaluator

RAG Evaluator.

This evaluator combines retrieval and generation evaluation for RAG pipelines using
  • LMBasedRetrievalEvaluator for retrieval quality
  • GEvalGenerationEvaluator for generation quality

It applies a customizable rule-based combiner over both retrieval and generation metrics to derive an overall RAG rating and explanation. By default, it uses the rule base from the generation evaluator.

Authors

Christina Alexandra (christina.alexandra@gdplabs.id)

References

NONE

HybridRuleBook(good, bad, issue_rules) dataclass

Rule book that supports both MetricSpec (int) and FloatMetricSpec (float).

This allows combining generation metrics (int-based) and retrieval metrics (float-based) in a single rule book.

Attributes:

Name Type Description
good Specification

The good rule (can be MetricSpec, FloatMetricSpec, or combination).

bad Specification

The bad rule (can be MetricSpec, FloatMetricSpec, or combination).

issue_rules Mapping[Issue, Specification]

Issue detection rules.

HybridRuleEngine(rules)

Bases: BaseRuleEngine[HybridRuleBook, Specification, Relevancy]

Rule engine that handles both int-based (MetricSpec) and float-based (FloatMetricSpec) metrics.

This engine can evaluate rules that combine generation metrics (int 0-4) and retrieval metrics (float 0.0-1.0) in a single rule book.

Initialize the HybridRuleEngine.

RAGEvaluator(retrieval_evaluator=None, generation_evaluator=None, retrieval_metrics=None, generation_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, rule_engine=None, judge=None, refusal_metric=None)

Bases: BaseEvaluator

Evaluator for RAG pipelines combining retrieval and generation evaluation.

This evaluator
  • Runs retrieval evaluation using LMBasedRetrievalEvaluator
  • Runs generation evaluation using GEvalGenerationEvaluator
  • Combines their scores using a customizable rule-based scheme to produce:
    • relevancy_rating (good / bad / incomplete)
    • score (aggregated RAG score)
    • possible_issues (list of textual issues)

Important Note on Rule Engine: By default, this evaluator uses GenerationRuleEngine with RuleBook which works with generation metrics (int scores 0-4). To include retrieval metrics in rule-based classification, use HybridRuleBook with HybridRuleEngine, which supports both MetricSpec (int-based) and FloatMetricSpec (float-based) metrics.

Default expected input
  • query (str): The query to evaluate.
  • expected_response (str): The expected response.
  • retrieved_context (str | list[str]): The retrieved contexts.
  • generated_response (str): The generated response.

Attributes:

Name Type Description
name str

The name of the evaluator.

retrieval_evaluator LMBasedRetrievalEvaluator

The retrieval evaluator.

generation_evaluator GEvalGenerationEvaluator

The generation evaluator.

rule_book RuleBook | HybridRuleBook | None

The rule book for evaluation (uses generation metrics by default). Use HybridRuleBook to combine both generation (int) and retrieval (float) metrics.

rule_engine GenerationRuleEngine | HybridRuleEngine | None

The rule engine for classification. Uses GenerationRuleEngine for RuleBook or HybridRuleEngine for HybridRuleBook.

run_parallel bool

Whether to run retrieval and generation evaluations in parallel.

Initialize the RAG evaluator.

Parameters:

Name Type Description Default
retrieval_evaluator LMBasedRetrievalEvaluator | None

Pre-configured retrieval evaluator. If provided, this will be used directly and retrieval_* parameters will be ignored. Defaults to None.

None
generation_evaluator GEvalGenerationEvaluator | None

Pre-configured generation evaluator. If provided, this will be used directly and generation_* parameters will be ignored. Defaults to None.

None
retrieval_metrics Sequence[BaseMetric] | None

Optional custom retrieval metric instances. Used only if retrieval_evaluator is None. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None.

None
generation_metrics Sequence[BaseMetric] | None

Optional custom generation metric instances. Used only if generation_evaluator is None. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None.

None
model str | ModelId | BaseLMInvoker

Model for the default metrics. Used only if evaluators are None. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

Credentials for the model, required when model is a string. Used only if evaluators are None. Defaults to None.

None
model_config dict[str, Any] | None

Optional model configuration. Used only if evaluators are None. Defaults to None.

None
run_parallel bool

Whether to run retrieval and generation evaluations in parallel. Used only if evaluators are None. Defaults to True.

True
rule_book RuleBook | HybridRuleBook | None

The rule book for evaluation. If not provided, a default one is generated based on enabled generation metrics. Use RuleBook for generation-only metrics (int-based) or HybridRuleBook to combine both generation (int) and retrieval (float) metrics. Defaults to None.

None
rule_engine GenerationRuleEngine | HybridRuleEngine | None

The rule engine for classification. If not provided, a new instance is created with the determined rule book. Use GenerationRuleEngine for RuleBook or HybridRuleEngine for HybridRuleBook. Defaults to None.

None
judge MultipleLLMAsJudge | None

Optional multiple LLM judge for ensemble evaluation. Used only if generation_evaluator is None. Defaults to None.

None
refusal_metric type[BaseMetric] | None

The refusal metric to use for generation evaluator. Used only if generation_evaluator is None. If None, the default refusal metric will be used. Defaults to None.

None

required_fields property

Return the union of required fields from both evaluators.