Utils
Utility modules for gllm-retrieval.
Fuseable
Bases: Protocol
Protocol for objects that can be fused using rank fusion algorithms.
Objects must have an id attribute for deduplication and an optional
score attribute that can be set with the fusion score.
id
property
Unique identifier for deduplication.
score
property
writable
Optional score attribute that can be set.
format_sql_query(query)
Format the SQL query to ensure it is correctly structured.
Removes the code block markdown from the SQL query and trims any leading or trailing whitespace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The SQL query output from the language model. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The formatted SQL query. |
validate_query(query, dialect='postgres')
Validates if the given string is an SQL statement using sqlglot.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The SQL query to be validated. |
required |
dialect
|
str
|
The SQL dialect to be used for validation. Defaults to "postgres". |
'postgres'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the query is not a valid SQL statement. |
weighted_reciprocal_rank(doc_lists, weights, rank_constant=60, set_scores=False)
Perform weighted Reciprocal Rank Fusion on multiple rank lists.
This function implements the Weighted Reciprocal Rank Fusion (RRF) algorithm, which combines multiple ranked document lists into a single ranked list. RRF is particularly effective for combining results from different retrieval strategies (e.g., filtered search and semantic search, or multiple retrievers).
The RRF score for each document is calculated as:
score = sum(weight_i / (rank_i + k)) for each list i
where rank_i is the document's rank in list i (1-based), and k is the rank constant.
Examples:
from gllm_retrieval.utils.rank_fusion import weighted_reciprocal_rank
filtered_results = [chunk1, chunk2, chunk3] # Ranked by entity filtering
semantic_results = [chunk2, chunk1, chunk4] # Ranked by semantic similarity
fused = weighted_reciprocal_rank(
[filtered_results, semantic_results],
weights=[0.2, 0.8],
rank_constant=60
)
# Returns chunks ordered by combined RRF scores
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_lists
|
list[list[FuseableT]]
|
Ranked lists of Fuseable objects to merge.
Must match the length of |
required |
weights
|
list[float]
|
Weights for each rank list. Higher weights give more importance to that retrieval source. |
required |
rank_constant
|
int
|
The rank constant (k) that controls the influence of rank position. Higher values reduce the impact of rank differences. Defaults to 60. |
60
|
set_scores
|
bool
|
If True, sets the |
False
|
Returns:
| Type | Description |
|---|---|
list[FuseableT]
|
list[FuseableT]: The final aggregated list of unique documents sorted by their weighted RRF scores in descending order. Documents with higher scores appear first. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the number of rank lists doesn't match the weights count. |
Note
- Documents are deduplicated by their
idfield - The
rank_constantparameter controls the influence of rank position - Higher
rank_constantvalues reduce the impact of rank differences - The algorithm is commutative - order of doc_lists doesn't matter if weights are adjusted
- Lists may overlap (same ID in multiple lists); duplicates are merged
- Empty inner lists are handled gracefully