Retriever
Module for retriever classes.
BM25Retriever(data_store)
Bases: BaseVectorRetriever
Retrieves documents from the Elasticsearch data store using BM25.
Examples:
retriever = BM25Retriever(data_store=elasticsearch_data_store)
results = await retriever.retrieve("search query", top_k=10)
Initialize the BM25Retriever.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_store
|
BaseVectorDataStore
|
The vector data store with BM25 capabilities. |
required |
BasicSQLRetriever(sql_data_store, lm_request_processor=None, extract_func=None, max_retries=0, preprocess_query_func=None)
Bases: BaseSQLRetriever
Initializes a new instance of the BasicSQLRetriever class.
This class provides a straightforward implementation of the BasicSQLRetriever, using a single data store for document retrieval.
Examples:
from gllm_datastore.sql_data_store.sql_data_store import SQLDataStore
from gllm_retrieval.retriever.sql_retriever.basic_sql_retriever import BasicSQLRetriever
# Create a retriever with an existing data store
retriever = BasicSQLRetriever(sql_data_store)
# Retrieve as chunks
chunks = await retriever.retrieve("What is machine learning?")
Attributes:
| Name | Type | Description |
|---|---|---|
sql_data_store |
BaseSQLDataStore
|
The SQL database data store to be used. |
lm_request_processor |
LMRequestProcessor | None
|
The LMRequestProcessor instance to be used for modifying a failed query. If not None, will modify the query upon failure. |
extract_func |
Callable[[str | list[str] | dict[str, str | list[str]]], str | list[str]]
|
A function to extract the modified query from the LM output. |
max_retries |
int
|
The maximum number of retries for the retrieval process. |
preprocess_query_func |
Callable[[str], str] | None
|
A function to preprocess the query before execution. |
logger |
Logger
|
The logger instance to be used for logging. |
Initializes the BasicSQLRetriever object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sql_data_store
|
BaseSQLDataStore
|
The SQL database data store to be used. |
required |
lm_request_processor
|
LMRequestProcessor | None
|
The LMRequestProcessor instance to be used for modifying a failed query. If not None, will modify the query upon failure. |
None
|
extract_func
|
Callable[[str | list[str] | dict[str, str | list[str]]], str | list[str]]
|
A function to extract the transformed query from the output. Defaults to None, in which case a default extractor will be used. |
None
|
max_retries
|
int
|
The maximum number of retries for the retrieval process. Defaults to 0. |
0
|
preprocess_query_func
|
Callable[[str], str] | None
|
A function to preprocess the query before. Defaults to None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If lm_request_processor is not provided when max_retries is greater than 0. |
retrieve(query, event_emitter=None, prompt_kwargs=None, return_query=False)
async
Retrieve data based on the query.
This method performs a retrieval operation using the configured data store.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The query string to retrieve documents. |
required |
event_emitter
|
EventEmitter | None
|
The event emitter to emit events. Defaults to None. |
None
|
prompt_kwargs
|
dict[str, Any] | None
|
Additional keyword arguments for the prompt. Defaults to None. |
None
|
return_query
|
bool
|
If True, returns a tuple of the executed query and the result. If False, returns only the result. Defaults to False. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame | tuple[str, DataFrame]
|
pd.DataFrame | tuple[str, pd.DataFrame]: The result of the retrieval process. If return_query is True, returns a tuple of the executed query and the result. If return_query is False, returns only the result. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the retrieval process fails after the maximum number of retries. |
BasicVectorRetriever(data_store)
Bases: BaseVectorRetriever
Initializes a new instance of the BasicVectorRetriever class.
This class provides a straightforward implementation of the BaseRetriever, using a single data store for document retrieval.
Examples:
retriever = BasicVectorRetriever(data_store=elasticsearch_data_store)
results = await retriever.retrieve("search query", top_k=10)
Attributes:
| Name | Type | Description |
|---|---|---|
data_store |
BaseVectorDataStore
|
The data store used for retrieval operations. |
Initializes a new instance of the BasicRetriever class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_store
|
BaseVectorDataStore
|
The data store to be used for retrieval operations. |
required |
LightRAGRetriever(data_store)
Bases: BaseGraphRAGRetriever
Retriever implementation for LightRAG-based graph RAG.
This class provides a retriever interface for LightRAG, a graph-based Retrieval Augmented Generation system. It handles the conversion between LightRAG's raw output format and structured data objects, allowing for retrieval of document chunks, knowledge graph entities, and relationships based on natural language queries.
The retriever works with LightRAG data stores to perform graph-aware retrieval operations that leverage both semantic similarity and graph structure to find relevant information.
Examples:
from gllm_datastore.graph_data_store.light_rag_data_store import LightRAGDataStore
from gllm_retrieval.retriever.graph_retriever.light_rag_retriever import LightRAGRetriever
from gllm_retrieval.retriever.graph_retriever.constants import ReturnType
# Create a retriever with an existing data store
retriever = LightRAGRetriever(light_rag_data_store)
# Retrieve as chunks
chunks = await retriever.retrieve("What is machine learning?")
# Retrieve as dictionary with nodes and edges
result_dict = await retriever.retrieve(
"What is machine learning?",
return_type=ReturnType.DICT
)
# Retrieve as chunk contents (strings)
chunk_contents = await retriever.retrieve(
"What is machine learning?",
return_type=ReturnType.STRINGS
)
# Retrieve raw result from LightRAG
raw_result = await retriever.retrieve(
"What is machine learning?",
return_type=ReturnType.STRING
)
# Retrieve as synthesized response
result = await retriever.retrieve(
"What is machine learning?",
only_need_context=False
)
Attributes:
| Name | Type | Description |
|---|---|---|
data_store |
BaseLightRAGDataStore
|
The LightRAG data store used for retrieval operations. |
_logger |
Logger
|
Logger instance for this class. |
Initialize the LightRAGRetriever.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_store
|
BaseLightRAGDataStore
|
The LightRAG data store to use for retrieval operations. |
required |
retrieve(query, retrieval_params=None, event_emitter=None, return_type=ReturnType.CHUNKS, only_need_context=None)
async
Retrieve information from LightRAG based on a natural language query.
This method queries the LightRAG data store and processes the results according to the specified return type. It can return the raw result, structured chunks, strings, or a dictionary containing nodes and edges from the knowledge graph.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The natural language query to retrieve information for. |
required |
retrieval_params
|
dict[str, Any] | None
|
Optional dictionary of parameters to pass to the LightRAG query. Defaults to None. |
None
|
event_emitter
|
EventEmitter | None
|
Optional event emitter for tracking retrieval events. Defaults to None. |
None
|
return_type
|
ReturnType
|
The type of result to return (CHUNKS, STRINGS, DICT, or RAW). Defaults to ReturnType.CHUNKS. |
CHUNKS
|
only_need_context
|
bool | None
|
Whether to only return the context (True) or the full result (False). If None, uses the value from retrieval_params or defaults to True. Defaults to None. |
None
|
Returns:
| Type | Description |
|---|---|
str | list[str] | list[Chunk] | dict[str, Any]
|
str | list[str] | list[Chunk] | dict[str, Any]: Depending on the only_need_context and return_type parameters: - only_need_context=False: Synthesized response string using LM invoker. - only_need_context=True: - ReturnType.CHUNKS (default): List of Chunk objects. - ReturnType.STRINGS: List of content strings from chunks. - ReturnType.DICT: Dictionary representation of LightRAGRetrievalResult. - ReturnType.STRING: Raw result string from LightRAG without postprocessing. |
LlamaIndexGraphRAGRetriever(data_store, property_graph_retriever=None, llama_index_llm=None, embed_model=None, vector_store=None, **kwargs)
Bases: BaseGraphRAGRetriever
A retriever class for querying a knowledge graph using the LlamaIndex framework.
Examples:
from gllm_datastore.graph_data_store.llama_index_graph_rag_data_store import LlamaIndexGraphRAGDataStore
from gllm_retrieval.retriever.graph_retriever.llama_index_graph_rag_retriever import LlamaIndexGraphRAGRetriever
from gllm_retrieval.retriever.graph_retriever.constants import ReturnType
# Create a retriever with an existing data store
retriever = LlamaIndexGraphRAGRetriever(llama_index_graph_rag_data_store)
# Retrieve as chunks
chunks = await retriever.retrieve("What is machine learning?")
# Retrieve as dictionary with nodes and edges
result_dict = await retriever.retrieve(
"What is machine learning?",
return_type=ReturnType.DICT
)
# Retrieve as chunk contents (strings)
chunk_contents = await retriever.retrieve(
"What is machine learning?",
return_type=ReturnType.STRINGS
)
# Retrieve raw result from LlamaIndexGraphRAG
raw_result = await retriever.retrieve(
"What is machine learning?",
return_type=ReturnType.STRING
)
# Retrieve as synthesized response
result = await retriever.retrieve(
"What is machine learning?",
only_need_context=False
)
Attributes:
| Name | Type | Description |
|---|---|---|
_index |
PropertyGraphIndex
|
The property graph index to use. |
_graph_store |
LlamaIndexGraphRAGDataStore | PropertyGraphStore
|
The graph store to use. |
_llm |
BaseLLM | None
|
The language model to use. |
_embed_model |
BaseEmbedding | None
|
The embedding model to use. |
_property_graph_retriever |
PGRetriever
|
The property graph retriever to use. |
_logger |
Logger
|
The logger to use. |
_default_return_type |
ReturnType
|
The default return type for retrieve method. |
Initializes the LlamaIndexGraphRAGRetriever with the provided components.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_store
|
LlamaIndexGraphRAGDataStore | PropertyGraphStore
|
The graph store to use. |
required |
property_graph_retriever
|
PGRetriever | None
|
An existing retriever to use. |
None
|
llama_index_llm
|
BaseLLM | None
|
The language model to use for text-to-Cypher retrieval. |
None
|
embed_model
|
BaseEmbedding | None
|
The embedding model to use. |
None
|
vector_store
|
BasePydanticVectorStore | None
|
The vector store to use. |
None
|
**kwargs
|
Any
|
Additional keyword arguments. Supported kwargs: - default_return_type (ReturnType): Default return type for retrieve method. Defaults to "chunks". |
{}
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If an invalid return type is provided. |
retrieve(query, retrieval_params=None, event_emitter=None, **kwargs)
async
Retrieves relevant documents for a given query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The query string to search for. |
required |
retrieval_params
|
dict[str, Any] | None
|
Additional retrieval parameters. |
None
|
event_emitter
|
EventEmitter | None
|
Event emitter for logging. |
None
|
**kwargs
|
Any
|
Additional keyword arguments. Supported kwargs: - return_type (ReturnType): Type of return value ("chunks" or "strings"). Defaults to value set in constructor. |
{}
|
Returns:
| Type | Description |
|---|---|
str | list[str] | list[Chunk] | dict[str, Any]
|
str | list[str] | list[Chunk] | dict[str, Any]: The result of the retrieval process based on return_type: - If return_type is "chunks": Returns list[Chunk] - If return_type is "strings": Returns list[str] - Returns empty list on error |
Raises:
| Type | Description |
|---|---|
ValueError
|
If an invalid return type is provided. |
PIIAwareRetriever(data_store, pii_resolver, weights=None, rank_constant=60, min_candidate=1, metadata_entities_field=METADATA_ENTITIES_FIELD)
Bases: BaseVectorRetriever
Privacy-preserving retriever with hybrid search and rank fusion.
This retriever handles PII in queries by: 1. Anonymizing queries before search operations 2. Executing hybrid search combining entity-filtered vector search with semantic search 3. Applying weighted Reciprocal Rank Fusion (RRF) to combine results 4. De-anonymizing retrieved chunks before returning to callers
Examples:
retriever = PIIAwareRetriever(
data_store=data_store,
pii_resolver=MetadataPIIResolver(),
weights=[0.3, 0.7],
rank_constant=60,
min_candidate=1
)
results = await retriever.retrieve("What did Alice say?", top_k=10)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_store
|
BaseVectorDataStore
|
Data store for vector search operations. |
required |
pii_resolver
|
BasePIIResolver
|
PII resolver for anonymization/de-anonymization. |
required |
weights
|
list[float] | None
|
Weights for [filtered, semantic] results. Defaults to [0.2, 0.8]. Weights must be positive and will be normalized to sum to 1.0. |
None
|
rank_constant
|
int
|
RRF rank constant (k). Defaults to 60. |
60
|
min_candidate
|
int
|
Minimum candidates per search method. Defaults to 1. |
1
|
metadata_entities_field
|
str
|
Field name in metadata containing PII entities. Defaults to "metadata.entities". |
METADATA_ENTITIES_FIELD
|
Initialize the PIIAwareRetriever.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_store
|
BaseVectorDataStore
|
Data store for vector search operations. |
required |
pii_resolver
|
BasePIIResolver
|
PII resolver for anonymization/de-anonymization. |
required |
weights
|
list[float] | None
|
Weights for [filtered, semantic] results. Defaults to [0.2, 0.8]. Must have exactly 2 elements with positive values. |
None
|
rank_constant
|
int
|
RRF rank constant (k). Must be positive. Defaults to 60. |
60
|
min_candidate
|
int
|
Minimum candidates per search method. Must be positive. Defaults to 1. |
1
|
metadata_entities_field
|
str
|
Field name in metadata containing PII entities. Defaults to "metadata.entities". |
METADATA_ENTITIES_FIELD
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If pii_resolver is not a BasePIIResolver subclass. |
ValueError
|
If weights does not have exactly 2 elements or contains non-positive values. |
ValueError
|
If rank_constant or min_candidate are not positive integers. |
weighted_reciprocal_rank(doc_lists)
Perform weighted Reciprocal Rank Fusion on multiple rank lists.
This method implements the Weighted Reciprocal Rank Fusion (RRF) algorithm, which combines multiple ranked document lists into a single ranked list. RRF is particularly effective for combining results from different retrieval strategies (e.g., filtered search and semantic search).
The RRF score for each document is calculated as:
score = sum(weight_i / (rank_i + k)) for each list i
where rank_i is the document's rank in list i (1-based), and k is the rank constant.
Examples:
filtered_results = [chunk1, chunk2, chunk3] # Ranked by entity filtering
semantic_results = [chunk2, chunk1, chunk4] # Ranked by semantic similarity
fused = retriever.weighted_reciprocal_rank([filtered_results, semantic_results])
# Returns chunks ordered by combined RRF scores
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doc_lists
|
list[list[Chunk]]
|
A list of rank lists to fuse.
- Must contain exactly 2 lists corresponding to [filtered, semantic] results
- Each inner list should contain |
required |
Returns:
| Type | Description |
|---|---|
list[Chunk]
|
list[Chunk]: The final aggregated list of unique documents sorted by their weighted RRF scores in descending order. Documents with higher scores appear first. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the number of rank lists doesn't match the configured weights count. |
Note
- Documents are deduplicated by their
idfield - The
rank_constantparameter controls the influence of rank position - Higher
rank_constantvalues reduce the impact of rank differences