Skip to content

Rag evaluator

RAG Evaluator.

This evaluator combines retrieval and generation evaluation for RAG pipelines using
  • LMBasedRetrievalEvaluator for retrieval quality
  • GEvalGenerationEvaluator for generation quality

It applies a customizable rule-based combiner over both retrieval and generation metrics to derive an overall RAG rating and explanation. By default, it uses the rule base from the generation evaluator.

HybridRuleBook(good, bad, issue_rules) dataclass

Rule book that supports both MetricSpec (int) and FloatMetricSpec (float).

This allows combining generation metrics (int-based) and retrieval metrics (float-based) in a single rule book.

Attributes:

Name Type Description
good Specification

The good rule (can be MetricSpec, FloatMetricSpec, or combination).

bad Specification

The bad rule (can be MetricSpec, FloatMetricSpec, or combination).

issue_rules Mapping[Issue, Specification]

Issue detection rules.

HybridRuleEngine(rules)

Bases: BaseRuleEngine[HybridRuleBook, Specification, Relevancy]

Rule engine that handles both int-based (MetricSpec) and float-based (FloatMetricSpec) metrics.

This engine can evaluate rules that combine generation metrics (int 0-4) and retrieval metrics (float 0.0-1.0) in a single rule book.

Initialize the HybridRuleEngine.

RAGEvaluator(retrieval_evaluator=None, generation_evaluator=None, retrieval_metrics=None, generation_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, rule_engine=None, judge=None, refusal_metric=None)

Bases: BaseEvaluator

Evaluator for RAG pipelines combining retrieval and generation evaluation.

This evaluator
  • Runs retrieval evaluation using LMBasedRetrievalEvaluator
  • Runs generation evaluation using GEvalGenerationEvaluator
  • Combines their scores using a customizable rule-based scheme to produce:
    • relevancy_rating (good / bad / incomplete)
    • score (aggregated RAG score)
    • possible_issues (list of textual issues)

Important Note on Rule Engine: By default, this evaluator uses GenerationRuleEngine with RuleBook which works with generation metrics (int scores 0-4). To include retrieval metrics in rule-based classification, use HybridRuleBook with HybridRuleEngine, which supports both MetricSpec (int-based) and FloatMetricSpec (float-based) metrics.

Default expected input
  • query (str): The query to evaluate.
  • expected_response (str): The expected response.
  • retrieved_context (str | list[str]): The retrieved contexts.
  • generated_response (str): The generated response.

Attributes:

Name Type Description
name str

The name of the evaluator.

retrieval_evaluator LMBasedRetrievalEvaluator

The retrieval evaluator.

generation_evaluator GEvalGenerationEvaluator

The generation evaluator.

rule_book RuleBook | HybridRuleBook | None

The rule book for evaluation (uses generation metrics by default). Use HybridRuleBook to combine both generation (int) and retrieval (float) metrics.

rule_engine GenerationRuleEngine | HybridRuleEngine | None

The rule engine for classification. Uses GenerationRuleEngine for RuleBook or HybridRuleEngine for HybridRuleBook.

run_parallel bool

Whether to run retrieval and generation evaluations in parallel.

Initialize the RAG evaluator.

Parameters:

Name Type Description Default
retrieval_evaluator LMBasedRetrievalEvaluator | None

Pre-configured retrieval evaluator. If provided, this will be used directly and retrieval_* parameters will be ignored. Defaults to None.

None
generation_evaluator GEvalGenerationEvaluator | None

Pre-configured generation evaluator. If provided, this will be used directly and generation_* parameters will be ignored. Defaults to None.

None
retrieval_metrics Sequence[BaseMetric] | None

Optional custom retrieval metric instances. Used only if retrieval_evaluator is None. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None.

None
generation_metrics Sequence[BaseMetric] | None

Optional custom generation metric instances. Used only if generation_evaluator is None. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None.

None
model str | ModelId | BaseLMInvoker

Model for the default metrics. Used only if evaluators are None. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

Credentials for the model, required when model is a string. Used only if evaluators are None. Defaults to None.

None
model_config dict[str, Any] | None

Optional model configuration. Used only if evaluators are None. Defaults to None.

None
run_parallel bool

Whether to run retrieval and generation evaluations in parallel. Used only if evaluators are None. Defaults to True.

True
rule_book RuleBook | HybridRuleBook | None

The rule book for evaluation. If not provided, a default one is generated based on enabled generation metrics. Use RuleBook for generation-only metrics (int-based) or HybridRuleBook to combine both generation (int) and retrieval (float) metrics. Defaults to None.

None
rule_engine GenerationRuleEngine | HybridRuleEngine | None

The rule engine for classification. If not provided, a new instance is created with the determined rule book. Use GenerationRuleEngine for RuleBook or HybridRuleEngine for HybridRuleBook. Defaults to None.

None
judge MultipleLLMAsJudge | None

Optional multiple LLM judge for ensemble evaluation. Used only if generation_evaluator is None. Defaults to None.

None
refusal_metric type[BaseMetric] | None

The refusal metric to use for generation evaluator. Used only if generation_evaluator is None. If None, the default refusal metric will be used. Defaults to None.

None

required_fields property

Return the union of required fields from both evaluators.