Skip to content

Lm based retrieval evaluator

LM-based retrieval evaluator.

This evaluator focuses on retrieval quality for RAG-style pipelines using

DeepEval contextual precision
DeepEval contextual recall

It applies a simple rule-based combiner over precision and recall to derive an overall retrieval rating and a global explanation.

`FloatMetricSpec(metric, comparator, threshold)`

Bases: Specification

Atomic specification for float-based metrics such as precision >= 0.7.

Attributes:

Name	Type	Description
`metric`	`str`	The metric to evaluate.
`threshold`	`float`	The threshold for the metric.

Initialize the FloatMetricSpec.

Parameters:

Name	Type	Description	Default
`metric`	`str`	The metric to evaluate.	required
`comparator`	`str`	The comparator to use.	required
`threshold`	`float`	The threshold for the metric.	required

`repr()`

Get the representation of the FloatMetricSpec.

Returns:

Name	Type	Description
`str`	`str`	The representation of the FloatMetricSpec.

`is_satisfied_by(candidate)`

Check if the candidate satisfies the metric specification.

Parameters:

Name	Type	Description	Default
`candidate`	`Mapping[str, float]`	The candidate to check.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if the candidate satisfies the metric specification, False otherwise.

`LMBasedRetrievalEvaluator(metrics=None, enabled_metrics=None, model=DefaultValues.MODEL, model_credentials=None, model_config=None, run_parallel=True, rule_book=None, rule_engine=None)`

Bases: BaseEvaluator

Evaluator for LM-based retrieval quality in RAG pipelines.

This evaluator

Runs a configurable set of retrieval metrics (by default: DeepEval contextual precision and contextual recall)
Combines their scores using a simple rule-based scheme to produce:
- relevancy_rating (good / bad / incomplete)
- score (aggregated retrieval score)
- possible_issues (list of textual issues)

Default expected input

query (str): The query to evaluate the metric.
expected_response (str): The expected response to evaluate the metric.
retrieved_context (str | list[str]): The list of retrieved contexts to evaluate the metric. If the retrieved context is a str, it will be converted into a list with a single element.

Attributes:

Name	Type	Description
`name`	`str`	The name of the evaluator.
`metrics`	`list[BaseMetric]`	The list of metrics to evaluate.
`enabled_metrics`	`Sequence[type[BaseMetric] \| str] \| None`	The list of metrics to enable.
`model`	`str \| ModelId \| BaseLMInvoker`	The model to use for the metrics.
`model_credentials`	`str \| None`	The model credentials to use for the metrics.
`model_config`	`dict[str, Any] \| None`	The model configuration to use for the metrics.
`run_parallel`	`bool`	Whether to run the metrics in parallel.
`rule_book`	`LMBasedRetrievalRuleBook \| None`	The rule book for evaluation.
`rule_engine`	`LMBasedRetrievalRuleEngine \| None`	The rule engine for classification.

Initialize the LM-based retrieval evaluator.

Parameters:

Name	Type	Description	Default
`metrics`	`Sequence[BaseMetric] \| None`	Optional custom retrieval metric instances. If provided, these will be used as the base pool and may override the default metrics by name. Defaults to None.	`None`
`enabled_metrics`	`Sequence[type[BaseMetric] \| str] \| None`	Optional subset of metrics to enable from the metric pool. Each entry can be either a metric class or its `name`. If None, all metrics from the pool are used. Defaults to None.	`None`
`model`	`str \| ModelId \| BaseLMInvoker`	Model for the default DeepEval metrics. Defaults to `DefaultValues.MODEL`.	`MODEL`
`model_credentials`	`str \| None`	Credentials for the model, required when `model` is a string. Defaults to None.	`None`
`model_config`	`dict[str, Any] \| None`	Optional model configuration. Defaults to None.	`None`
`run_parallel`	`bool`	Whether to run retrieval metrics in parallel. Defaults to True.	`True`
`rule_book`	`LMBasedRetrievalRuleBook \| None`	The rule book for evaluation. If not provided, a default one is generated based on enabled metrics. Defaults to None.	`None`
`rule_engine`	`LMBasedRetrievalRuleEngine \| None`	The rule engine for classification. If not provided, a new instance is created with the determined rule book. Defaults to None.	`None`

`required_fields` `property`

Return the union of required fields from all configured metrics.

`LMBasedRetrievalRuleBook(good, bad)` `dataclass`

Bundle of good and bad composite specs for LM-based retrieval.

Both are plain FloatMetricSpec objects, so you can build them with & / | / ~ or any other sutoppu combinator.

Attributes:

Name	Type	Description
`good`	`FloatMetricSpec`	The good rule.
`bad`	`FloatMetricSpec`	The bad rule.

`LMBasedRetrievalRuleEngine(rules)`

Bases: BaseRuleEngine[LMBasedRetrievalRuleBook, FloatMetricSpec, str]

Classify metric dictionaries using an injected LMBasedRetrievalRuleBook.

Initialize the LMBasedRetrievalRuleEngine.

`RuleConfig(defaults, score_key, comparator, empty_fallback, combine_with_and)` `dataclass`

Configuration for building a composite rule.

`from_default_rule_book(enabled_metrics=None, metric_thresholds=None)` `staticmethod`

Get the default rule engine.

Parameters:

Name	Type	Description	Default
`enabled_metrics`	`list[str] \| None`	A list of metric names to include. If None, all are included. Defaults to None.	`None`
`metric_thresholds`	`dict[str, dict[str, float]] \| None`	Optional per-metric threshold overrides. Keys are metric names and values are dictionaries with optional "good_score" and "bad_score" entries. Defaults to None.	`None`

Returns:

Name	Type	Description
`LMBasedRetrievalRuleEngine`	`'LMBasedRetrievalRuleEngine'`	The default rule engine.

`get_adaptive_rule_book(available_metrics, metric_thresholds=None)` `staticmethod`

Generate rule book based on available metrics.

Parameters:

Name	Type	Description	Default
`available_metrics`	`list[str]`	The list of available metrics.	required
`metric_thresholds`	`dict[str, dict[str, float]] \| None`	Optional per-metric threshold overrides. Keys are metric names and values are dictionaries with optional "good_score" and "bad_score" entries. Defaults to None. Note: This parameter is deprecated. Prefer creating a custom rule_book directly.	`None`

Returns:

Name	Type	Description
`LMBasedRetrievalRuleBook`	`LMBasedRetrievalRuleBook`	The adaptive rule book.

`get_default_bad_rule(enabled_metrics=None, metric_thresholds=None)` `staticmethod`

Get the bad rule.

Parameters:

Name	Type	Description	Default
`enabled_metrics`	`list[str] \| None`	A list of metric names to include. If None, all are included. Defaults to None.	`None`
`metric_thresholds`	`dict[str, dict[str, float]] \| None`	Optional per-metric threshold overrides. Keys are metric names and values are dictionaries with optional "bad_score" entries. Defaults to None.	`None`

Returns:

Name	Type	Description
`FloatMetricSpec`	`FloatMetricSpec`	The bad rule.

`get_default_good_rule(enabled_metrics=None, metric_thresholds=None)` `staticmethod`

Get the good rule.

Parameters:

Name	Type	Description	Default
`enabled_metrics`	`list[str] \| None`	A list of metric names to include. If None, all are included. Defaults to None.	`None`
`metric_thresholds`	`dict[str, dict[str, float]] \| None`	Optional per-metric threshold overrides. Keys are metric names and values are dictionaries with optional "good_score" entries. Defaults to None.	`None`

Returns:

Name	Type	Description
`FloatMetricSpec`	`FloatMetricSpec`	The good rule.

`get_default_rule_book(enabled_metrics=None, metric_thresholds=None)` `staticmethod`

Get the default rule book.

Parameters:

Name	Type	Description	Default
`enabled_metrics`	`list[str] \| None`	A list of metric names to include. If None, all are included. Defaults to None.	`None`
`metric_thresholds`	`dict[str, dict[str, float]] \| None`	Optional per-metric threshold overrides. Keys are metric names and values are dictionaries with optional "good_score" and "bad_score" entries. Defaults to None. Note: This parameter is deprecated. Prefer creating a custom rule_book directly.	`None`

Returns:

Name	Type	Description
`LMBasedRetrievalRuleBook`	`LMBasedRetrievalRuleBook`	The default rule book.