Skip to content

Ensemble calculator

Ensemble Calculator for Multiple LLM Judges.

This module provides utilities for aggregating results from multiple LLM judges using ensemble methods and calculating statistical measures of agreement.

Authors

Apri Dwi Rachmadi (apri.d.rachmadi@gdplabs.id)

References

NONE

EnsembleCalculator(ensemble_method=EnsembleMethod.MEDIAN, weights=None)

Calculates ensemble statistics from multiple LLM judge results.

This class handles the aggregation of evaluation results from multiple LLM judges using various ensemble methods and provides statistical measures of judge agreement.

Attributes:

Name Type Description
ensemble_method EnsembleMethod

The method to use for aggregating scores.

weights Optional[List[float]]

Optional weights for weighted ensemble methods.

Initialize the EnsembleCalculator.

Parameters:

Name Type Description Default
ensemble_method EnsembleMethod

The ensemble method to use. Defaults to EnsembleMethod.MEDIAN.

MEDIAN
weights Optional[List[float]]

Weights for each judge. If None, defaults to equal weights (1.0 for each judge). Defaults to None.

None

calculate_ensemble_result(judge_results)

Calculate ensemble result from multiple judge evaluations.

Parameters:

Name Type Description Default
judge_results List[Dict[str, Any]]

List of evaluation results from each judge. Each result should contain either: - 'relevancy_rating' key (GenerationEvaluator) with categorical values - 'score' key (QTEvaluator) with numeric values

required

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: Ensemble result containing aggregated scores and statistics.

EnsembleMethod

Bases: StrEnum

Enumeration of ensemble methods for aggregating judge results.

Attributes:

Name Type Description
MEDIAN str

Use median of all judge scores.

AVERAGE_ROUNDED str

Use rounded average of all judge scores.