Skip to content

Judge

Judge module for multiple LLM evaluation and ensemble calculation.

This module provides utilities for orchestrating multiple LLM judges and calculating ensemble statistics from their evaluation results.

EnsembleCalculator(ensemble_method=EnsembleMethod.MEDIAN, weights=None)

Calculates ensemble statistics from multiple LLM judge results.

This class handles the aggregation of evaluation results from multiple LLM judges using various ensemble methods and provides statistical measures of judge agreement.

Attributes:

Name Type Description
ensemble_method EnsembleMethod

The method to use for aggregating scores.

weights Optional[List[float]]

Optional weights for weighted ensemble methods.

Initialize the EnsembleCalculator.

Parameters:

Name Type Description Default
ensemble_method EnsembleMethod

The ensemble method to use. Defaults to EnsembleMethod.MEDIAN.

MEDIAN
weights Optional[List[float]]

Weights for each judge. If None, defaults to equal weights (1.0 for each judge). Defaults to None.

None

calculate_ensemble_result(judge_results)

Calculate ensemble result from multiple judge evaluations.

Parameters:

Name Type Description Default
judge_results List[Dict[str, Any]]

List of evaluation results from each judge. Each result should contain either: - 'relevancy_rating' key (GenerationEvaluator) with categorical values - 'score' key (QTEvaluator) with numeric values

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: Ensemble result containing aggregated scores and statistics.

EnsembleMethod

Bases: StrEnum

Enumeration of ensemble methods for aggregating judge results.

Attributes:

Name Type Description
MEDIAN str

Use median of all judge scores.

AVERAGE_ROUNDED str

Use rounded average of all judge scores.

MultipleLLMAsJudge(judge_models, ensemble_method=EnsembleMethod.MEDIAN, weights=None, run_parallel=True)

Orchestrates multiple LLM judges for parallel evaluation and ensemble aggregation.

This class configures multiple judge models, runs parallel evaluations using provided evaluator instances, and aggregates results using ensemble methods.

Attributes:

Name Type Description
judge_models List[Dict[str, Any]]

List of judge model configurations.

ensemble_calculator EnsembleCalculator

Calculator for ensemble statistics.

run_parallel bool

Whether to run judges in parallel.

Initialize the MultipleLLMAsJudge.

Parameters:

Name Type Description Default
judge_models List[Dict[str, Any]]

List of judge model configurations. Each dict should contain: - 'provider_model_id' (str): Model identifier (e.g., 'gpt-4o', 'claude-3-5-sonnet') - 'model_credentials' (str): API credentials for the model - 'model_config' (Dict[str, Any], optional): Additional model configuration

required
ensemble_method EnsembleMethod

Method for aggregating judge results. Defaults to EnsembleMethod.MEDIAN.

MEDIAN
weights Optional[List[float]]

Weights for each judge when using weighted methods. Can be any positive values (no sum to 1.0 restriction). Must have same length as number of judges. Defaults to None (equal weights for all judges).

None
run_parallel bool

Whether to run judges in parallel. Defaults to True.

True

Raises:

Type Description
ValueError

If judge_models is empty or invalid, or if weights are invalid.

evaluate(evaluator_class, data, **evaluator_kwargs) async

Run all judges in parallel and aggregate results.

Parameters:

Name Type Description Default
evaluator_class type

The evaluator class to instantiate for each judge.

required
data MetricInput

The data to evaluate.

required
**evaluator_kwargs

Additional keyword arguments to pass to evaluator constructor.

{}

Returns:

Name Type Description
EvaluationOutput EvaluationOutput

Ensemble evaluation result with individual judge results.