Skip to content

Lm based retrieval evaluator

LM-based retrieval evaluator.

This evaluator focuses on retrieval quality for RAG-style pipelines using
  • DeepEval contextual precision
  • DeepEval contextual recall

It applies a simple rule-based combiner over precision and recall to derive an overall retrieval rating and a global explanation.

Authors

Christina Alexandra (christina.alexandra@gdplabs.id)

References

NONE

FloatMetricSpec(metric, comparator, threshold)

Bases: Specification

Atomic specification for float-based metrics such as precision >= 0.7.

Attributes:

Name Type Description
metric str

The metric to evaluate.

threshold float

The threshold for the metric.

Initialize the FloatMetricSpec.

Parameters:

Name Type Description Default
metric str

The metric to evaluate.

required
comparator str

The comparator to use.

required
threshold float

The threshold for the metric.

required

__repr__()

Get the representation of the FloatMetricSpec.

Returns:

Name Type Description
str str

The representation of the FloatMetricSpec.

is_satisfied_by(candidate)

Check if the candidate satisfies the metric specification.

Parameters:

Name Type Description Default
candidate Mapping[str, float]

The candidate to check.

required

Returns:

Name Type Description
bool bool

True if the candidate satisfies the metric specification, False otherwise.

LMBasedRetrievalEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, rule_engine=None)

Bases: BaseEvaluator

Evaluator for LM-based retrieval quality in RAG pipelines.

This evaluator
  • Runs a configurable set of retrieval metrics (by default: DeepEval contextual precision and contextual recall)
  • Combines their scores using a simple rule-based scheme to produce:
    • relevancy_rating (good / bad / incomplete)
    • score (aggregated retrieval score)
    • possible_issues (list of textual issues)
Default expected input
  • query (str): The query to evaluate the metric.
  • expected_response (str): The expected response to evaluate the metric.
  • retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. If the retrieved context is a str, it will be converted into a list with a single element.

Attributes:

Name Type Description
name str

The name of the evaluator.

metrics list[BaseMetric]

The list of metrics to evaluate.

enabled_metrics Sequence[type[BaseMetric] | str] | None

The list of metrics to enable.

model str | ModelId | BaseLMInvoker

The model to use for the metrics.

model_credentials str | None

The model credentials to use for the metrics.

model_config dict[str, Any] | None

The model configuration to use for the metrics.

run_parallel bool

Whether to run the metrics in parallel.

rule_book LMBasedRetrievalRuleBook | None

The rule book for evaluation.

rule_engine LMBasedRetrievalRuleEngine | None

The rule engine for classification.

Initialize the LM-based retrieval evaluator.

Parameters:

Name Type Description Default
metrics Sequence[BaseMetric] | None

Optional custom retrieval metric instances. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None.

None
enabled_metrics Sequence[type[BaseMetric] | str] | None

Optional subset of metrics to enable from the metric pool. Each entry can be either a metric class or its name. If None, all metrics from the pool are used. Defaults to None.

None
model str | ModelId | BaseLMInvoker

Model for the default DeepEval metrics. Defaults to DefaultValues.MODEL.

MODEL
model_credentials str | None

Credentials for the model, required when model is a string. Defaults to None.

None
model_config dict[str, Any] | None

Optional model configuration. Defaults to None.

None
run_parallel bool

Whether to run retrieval metrics in parallel. Defaults to True.

True
rule_book LMBasedRetrievalRuleBook | None

The rule book for evaluation. If not provided, a default one is generated based on enabled metrics. Defaults to None.

None
rule_engine LMBasedRetrievalRuleEngine | None

The rule engine for classification. If not provided, a new instance is created with the determined rule book. Defaults to None.

None

required_fields property

Return the union of required fields from all configured metrics.

LMBasedRetrievalRuleBook(good, bad) dataclass

Bundle of good and bad composite specs for LM-based retrieval.

Both are plain FloatMetricSpec objects, so you can build them with & / | / ~ or any other sutoppu combinator.

Attributes:

Name Type Description
good FloatMetricSpec

The good rule.

bad FloatMetricSpec

The bad rule.

LMBasedRetrievalRuleEngine(rules)

Bases: BaseRuleEngine[LMBasedRetrievalRuleBook, FloatMetricSpec, str]

Classify metric dictionaries using an injected LMBasedRetrievalRuleBook.

Initialize the LMBasedRetrievalRuleEngine.

RuleConfig(defaults, score_key, comparator, empty_fallback, combine_with_and) dataclass

Configuration for building a composite rule.

from_default_rule_book(enabled_metrics=None, metric_thresholds=None) staticmethod

Get the default rule engine.

Parameters:

Name Type Description Default
enabled_metrics list[str] | None

A list of metric names to include. If None, all are included. Defaults to None.

None
metric_thresholds dict[str, dict[str, float]] | None

Optional per-metric threshold overrides. Keys are metric names and values are dictionaries with optional "good_score" and "bad_score" entries. Defaults to None.

None

Returns:

Name Type Description
LMBasedRetrievalRuleEngine 'LMBasedRetrievalRuleEngine'

The default rule engine.

get_adaptive_rule_book(available_metrics, metric_thresholds=None) staticmethod

Generate rule book based on available metrics.

Parameters:

Name Type Description Default
available_metrics list[str]

The list of available metrics.

required
metric_thresholds dict[str, dict[str, float]] | None

Optional per-metric threshold overrides. Keys are metric names and values are dictionaries with optional "good_score" and "bad_score" entries. Defaults to None. Note: This parameter is deprecated. Prefer creating a custom rule_book directly.

None

Returns:

Name Type Description
LMBasedRetrievalRuleBook LMBasedRetrievalRuleBook

The adaptive rule book.

get_default_bad_rule(enabled_metrics=None, metric_thresholds=None) staticmethod

Get the bad rule.

Parameters:

Name Type Description Default
enabled_metrics list[str] | None

A list of metric names to include. If None, all are included. Defaults to None.

None
metric_thresholds dict[str, dict[str, float]] | None

Optional per-metric threshold overrides. Keys are metric names and values are dictionaries with optional "bad_score" entries. Defaults to None.

None

Returns:

Name Type Description
FloatMetricSpec FloatMetricSpec

The bad rule.

get_default_good_rule(enabled_metrics=None, metric_thresholds=None) staticmethod

Get the good rule.

Parameters:

Name Type Description Default
enabled_metrics list[str] | None

A list of metric names to include. If None, all are included. Defaults to None.

None
metric_thresholds dict[str, dict[str, float]] | None

Optional per-metric threshold overrides. Keys are metric names and values are dictionaries with optional "good_score" entries. Defaults to None.

None

Returns:

Name Type Description
FloatMetricSpec FloatMetricSpec

The good rule.

get_default_rule_book(enabled_metrics=None, metric_thresholds=None) staticmethod

Get the default rule book.

Parameters:

Name Type Description Default
enabled_metrics list[str] | None

A list of metric names to include. If None, all are included. Defaults to None.

None
metric_thresholds dict[str, dict[str, float]] | None

Optional per-metric threshold overrides. Keys are metric names and values are dictionaries with optional "good_score" and "bad_score" entries. Defaults to None. Note: This parameter is deprecated. Prefer creating a custom rule_book directly.

None

Returns:

Name Type Description
LMBasedRetrievalRuleBook LMBasedRetrievalRuleBook

The default rule book.