Multiple llm as judge
Multiple LLM as Judge Implementation.
This module provides a class for using multiple LLM models as judges to evaluate generation tasks in parallel and aggregate their results using ensemble methods.
References
NONE
MultipleLLMAsJudge(judge_models, ensemble_method=EnsembleMethod.MEDIAN, weights=None, run_parallel=True)
Orchestrates multiple LLM judges for parallel evaluation and ensemble aggregation.
This class configures multiple judge models, runs parallel evaluations using provided evaluator instances, and aggregates results using ensemble methods.
Attributes:
| Name | Type | Description |
|---|---|---|
judge_models |
List[Dict[str, Any]]
|
List of judge model configurations. |
ensemble_calculator |
EnsembleCalculator
|
Calculator for ensemble statistics. |
run_parallel |
bool
|
Whether to run judges in parallel. |
Initialize the MultipleLLMAsJudge.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
judge_models |
List[Dict[str, Any]]
|
List of judge model configurations. Each dict should contain: - 'provider_model_id' (str): Model identifier (e.g., 'gpt-4o', 'claude-3-5-sonnet') - 'model_credentials' (str): API credentials for the model - 'model_config' (Dict[str, Any], optional): Additional model configuration |
required |
ensemble_method |
EnsembleMethod
|
Method for aggregating judge results. Defaults to EnsembleMethod.MEDIAN. |
MEDIAN
|
weights |
Optional[List[float]]
|
Weights for each judge when using weighted methods. Can be any positive values (no sum to 1.0 restriction). Must have same length as number of judges. Defaults to None (equal weights for all judges). |
None
|
run_parallel |
bool
|
Whether to run judges in parallel. Defaults to True. |
True
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If judge_models is empty or invalid, or if weights are invalid. |
evaluate(evaluator_class, data, **evaluator_kwargs)
async
Run all judges in parallel and aggregate results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluator_class |
type
|
The evaluator class to instantiate for each judge. |
required |
data |
MetricInput
|
The data to evaluate. |
required |
**evaluator_kwargs |
Additional keyword arguments to pass to evaluator constructor. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
EvaluationOutput |
EvaluationOutput
|
Ensemble evaluation result with individual judge results. |