Skip to content

Utils

Utility modules for gllm-retrieval.

Fuseable

Bases: Protocol

Protocol for objects that can be fused using rank fusion algorithms.

Objects must have an id attribute for deduplication and an optional score attribute that can be set with the fusion score.

id property

Unique identifier for deduplication.

score property writable

Optional score attribute that can be set.

format_sql_query(query)

Format the SQL query to ensure it is correctly structured.

Removes the code block markdown from the SQL query and trims any leading or trailing whitespace.

Parameters:

Name Type Description Default
query str

The SQL query output from the language model.

required

Returns:

Name Type Description
str str

The formatted SQL query.

validate_query(query, dialect='postgres')

Validates if the given string is an SQL statement using sqlglot.

Parameters:

Name Type Description Default
query str

The SQL query to be validated.

required
dialect str

The SQL dialect to be used for validation. Defaults to "postgres".

'postgres'

Raises:

Type Description
ValueError

If the query is not a valid SQL statement.

weighted_reciprocal_rank(doc_lists, weights, rank_constant=60, set_scores=False)

Perform weighted Reciprocal Rank Fusion on multiple rank lists.

This function implements the Weighted Reciprocal Rank Fusion (RRF) algorithm, which combines multiple ranked document lists into a single ranked list. RRF is particularly effective for combining results from different retrieval strategies (e.g., filtered search and semantic search, or multiple retrievers).

The RRF score for each document is calculated as: score = sum(weight_i / (rank_i + k)) for each list i where rank_i is the document's rank in list i (1-based), and k is the rank constant.

Examples:

from gllm_retrieval.utils.rank_fusion import weighted_reciprocal_rank

filtered_results = [chunk1, chunk2, chunk3]  # Ranked by entity filtering
semantic_results = [chunk2, chunk1, chunk4]  # Ranked by semantic similarity
fused = weighted_reciprocal_rank(
    [filtered_results, semantic_results],
    weights=[0.2, 0.8],
    rank_constant=60
)
# Returns chunks ordered by combined RRF scores

Parameters:

Name Type Description Default
doc_lists list[list[FuseableT]]

Ranked lists of Fuseable objects to merge. Must match the length of weights.

required
weights list[float]

Weights for each rank list. Higher weights give more importance to that retrieval source.

required
rank_constant int

The rank constant (k) that controls the influence of rank position. Higher values reduce the impact of rank differences. Defaults to 60.

60
set_scores bool

If True, sets the score attribute on each document to its RRF score. Defaults to False.

False

Returns:

Type Description
list[FuseableT]

list[FuseableT]: The final aggregated list of unique documents sorted by their weighted RRF scores in descending order. Documents with higher scores appear first.

Raises:

Type Description
ValueError

If the number of rank lists doesn't match the weights count.

Note
  1. Documents are deduplicated by their id field
  2. The rank_constant parameter controls the influence of rank position
  3. Higher rank_constant values reduce the impact of rank differences
  4. The algorithm is commutative - order of doc_lists doesn't matter if weights are adjusted
  5. Lists may overlap (same ID in multiple lists); duplicates are merged
  6. Empty inner lists are handled gracefully