Deepeval geval
DeepEval GEval Metric Integration.
DeepEvalGEvalMetric(name=None, evaluation_params=None, model=DefaultValues.MODEL, criteria=None, evaluation_steps=None, rubric=None, model_credentials=None, model_config=None, threshold=0.5, additional_context=None, batch_status_check_interval=DefaultValues.BATCH_STATUS_CHECK_INTERVAL, batch_max_iterations=DefaultValues.BATCH_MAX_ITERATIONS)
Bases: DeepEvalMetricFactory, PromptExtractionMixin
DeepEval GEval Metric Integration.
This class is a wrapper for the DeepEvalGEval class. It is used to wrap the GEval class and provide a unified interface for the DeepEval library.
GEval is a versatile evaluation metric framework that can be used to create custom evaluation metrics.
Available Fields
- query (str, optional): The query to evaluate the metric. Similar to
inputinLLMTestCaseParams. - generated_response (str | list[str], optional): The generated response to evaluate the metric. Similar to
actual_outputinLLMTestCaseParams. If the generated response is a list, the responses are concatenated to a single string. - expected_response (str | list[str], optional): The expected response to evaluate the metric. Similar to
expected_outputinLLMTestCaseParams. If the expected response is a list, the responses are concatenated to a single string. - expected_retrieved_context (str | list[str], optional): The expected retrieved context to evaluate the metric.
Similar to
contextinLLMTestCaseParams. If the expected retrieved context is a str, it will be converted to a list with a single element. - retrieved_context (str | list[str], optional): The list of retrieved contexts to evaluate the metric. Similar to
retrieval_contextinLLMTestCaseParams. If the retrieved context is a str, it will be converted into a list with a single element.
Scoring
- 0.0-1.0 (Continuous): Or Boolean depending on the DeepEval GEval configuration.
Initializes the DeepEvalGEvalMetric class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str | None
|
The name of the metric. Defaults to None. Required if not provided via _defaults. |
None
|
evaluation_params
|
list[LLMTestCaseParams] | None
|
The evaluation parameters. Defaults to None. Required if not provided via _defaults. |
None
|
model
|
str | ModelId | BaseLMInvoker
|
The model to use for the metric. Defaults to "openai/gpt-4.1". |
MODEL
|
criteria
|
str | None
|
The criteria to use for the metric. Defaults to None. |
None
|
evaluation_steps
|
list[str] | None
|
The evaluation steps to use for the metric. Defaults to None. |
None
|
rubric
|
list[Rubric] | None
|
The rubric to use for the metric. Defaults to None. |
None
|
model_credentials
|
str | None
|
The model credentials to use for the metric. Defaults to None. Required when model is a string. |
None
|
model_config
|
dict[str, Any] | None
|
The model config to use for the metric. Defaults to None. |
None
|
threshold
|
float
|
The threshold to use for the metric. Defaults to 0.5. |
0.5
|
additional_context
|
str | None
|
Additional context like few-shot examples. Defaults to None. |
None
|
batch_status_check_interval
|
float
|
Time between batch status checks in seconds. Defaults to 30.0. |
BATCH_STATUS_CHECK_INTERVAL
|
batch_max_iterations
|
int
|
Maximum number of status check iterations before timeout. Defaults to 120. |
BATCH_MAX_ITERATIONS
|
evaluate(data)
async
Evaluate with custom prompt lifecycle support.
Overrides BaseMetric.evaluate() to add custom prompt application and state management. This ensures custom prompts are applied before evaluation and state is restored after.
For batch processing, checks for custom prompts and processes accordingly. Currently processes items individually; batch optimization can be added in future.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput | list[MetricInput]
|
Single data item or list of data items to evaluate. |
required |
Returns:
| Type | Description |
|---|---|
MetricOutput | list[MetricOutput]
|
Evaluation results with scores namespaced by metric name. |
get_custom_prompt_base_name()
Get the base name for custom prompt column lookup.
For GEval metrics, removes the 'geval_' prefix to align with CSV column conventions. This fixes Issue #3 by providing polymorphic naming for GEval metrics.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The base name without 'geval_' prefix (e.g., "completeness" instead of "geval_completeness"). |
Example
metric.name = "geval_completeness" metric.get_custom_prompt_base_name() -> "completeness"
CSV columns expected: - fewshot_completeness - fewshot_completeness_mode - evaluation_step_completeness
get_full_prompt(data)
Get the full prompt that DeepEval generates for this metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput
|
The metric input. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The complete prompt (system + user) as a string. |
GllmGEvalTemplate
Bases: GEvalTemplate
GEval template variant with reason before score.
generate_evaluation_results(evaluation_steps, test_case_content, parameters, rubric=None, score_range=(1, 3), _additional_context=None, multimodal=False)
staticmethod
Generate evaluation prompt with reason listed before score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluation_steps
|
str
|
Numbered evaluation steps used to judge the response. |
required |
test_case_content
|
str
|
Rendered test case content included in the prompt. |
required |
parameters
|
str
|
Evaluation parameter names referenced by the evaluator. |
required |
rubric
|
str | None
|
Formatted rubric text to include in the prompt. Defaults to None. |
None
|
score_range
|
tuple[int, int]
|
Inclusive score range for the evaluator. Defaults to (1, 3). |
(1, 3)
|
_additional_context
|
str | None
|
Additional context such as few-shot examples. Defaults to None. |
None
|
multimodal
|
bool
|
Whether to include multimodal evaluation rules. Defaults to False. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Full evaluation prompt string. |
MetricDefaults(name='', criteria=None, evaluation_steps=None, rubric=None, evaluation_params=None, additional_context=None)
dataclass
Metric defaults for DeepEval GEval.
PromptExtractionMixin
Mixin class that provides get_full_prompt functionality for metrics.
This mixin provides a standard interface for metrics that support prompt extraction. Metrics that inherit from this mixin should implement the get_full_prompt method.
get_full_prompt(data)
Get the full prompt that the metric generates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MetricInput
|
The metric input. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The complete prompt as a string. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If the metric doesn't support prompt extraction. |