Hybrid

Elasticsearch implementation of hybrid search capability.

This module provides hybrid search combining fulltext (BM25) and vector (dense) retrieval using weighted score boosting. Scores are combined as a weighted sum of BM25 and vector similarity scores.

`ElasticsearchHybridCapability(index_name, client, config, encryption=None)`

Elasticsearch implementation of HybridCapability using weighted score boosting.

Combines fulltext (BM25) and vector (dense) retrieval in a single query. Final score = sum(weight_i * score_i) for each configured search type.

Initialize the Elasticsearch hybrid capability.

Examples:

from gllm_datastore.data_store.elasticsearch.hybrid import ElasticsearchHybridCapability
from gllm_datastore.core.capabilities.hybrid_capability import HybridSearchType, SearchConfig

config = [
    SearchConfig(HybridSearchType.FULLTEXT, field="text", weight=0.3),
    SearchConfig(HybridSearchType.VECTOR, field="embedding", weight=0.7, em_invoker=em_invoker),
]
capability = ElasticsearchHybridCapability(
    index_name="my_index",
    client=es_client,
    config=config,
)

Parameters:

Name	Type	Description	Default
`index_name`	`str`	The name of the Elasticsearch index.	required
`client`	`AsyncElasticsearch`	The Elasticsearch client.	required
`config`	`list[SearchConfig]`	List of search configurations (FULLTEXT and/or VECTOR).	required
`encryption`	`EncryptionCapability \| None`	Encryption capability. Defaults to None.	`None`

`fulltext_configs` `property`

Return configs for FULLTEXT search type.

`vector_configs` `property`

Return configs for VECTOR search type.

`clear(**kwargs)` `async`

Clear all records from the datastore.

Examples:

await hybrid_capability.clear()

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	Extra arguments passed through to the Elasticsearch delete_by_query API (e.g. refresh, timeout). No default.	`{}`

`create(chunks, **kwargs)` `async`

Create chunks with automatic generation of all configured search fields.

For each chunk: indexes text in each FULLTEXT field and generates dense embeddings for each VECTOR field using the configured em_invoker. When encryption is enabled, embeddings are generated from plaintext first, then chunks are encrypted, so embeddings represent the original content.

Examples:

from gllm_core.schema import Chunk

chunks = [
    Chunk(id="1", content="Machine learning basics", metadata={"source": "doc1"}),
    Chunk(id="2", content="Deep learning networks", metadata={"source": "doc2"}),
]
await hybrid_capability.create(chunks)

Parameters:

Name	Type	Description	Default
`chunks`	`list[Chunk]`	List of chunks to create and index. Each chunk's content is indexed in every FULLTEXT config field and embedded for every VECTOR config field via the configured em_invoker.	required
`**kwargs`	`Any`	Extra arguments passed through to the Elasticsearch bulk API (e.g. refresh, timeout). No default.	`{}`

`create_from_vector(chunks, dense_vectors=None, **kwargs)` `async`

Create chunks with pre-computed vectors for vector fields.

Field names in dense_vectors must match VECTOR config fields. FULLTEXT fields are filled from chunk content. When encryption is enabled, vectors must be from plaintext; chunks are encrypted after aligning vectors, so embeddings are never from ciphertext.

Examples:

from gllm_core.schema import Chunk
from gllm_inference.schema import Vector

chunks = [
    Chunk(id="1", content="ML basics", metadata={"source": "doc1"}),
    Chunk(id="2", content="DL networks", metadata={"source": "doc2"}),
]
dense_vectors = {
    "embedding": [
        (chunks[0], Vector([0.1, 0.2, 0.3])),
        (chunks[1], Vector([0.4, 0.5, 0.6])),
    ],
}
await hybrid_capability.create_from_vector(chunks, dense_vectors=dense_vectors)

Parameters:

Name	Type	Description	Default
`chunks`	`list[Chunk]`	Chunks to index. FULLTEXT config fields are populated from each chunk's content; VECTOR fields come from dense_vectors when provided for that field.	required
`dense_vectors`	`dict[str, list[tuple[Chunk, Vector]]] \| None`	Map from VECTOR config field name to list of (chunk, vector) tuples in chunk order. Keys must match VECTOR config field names. Defaults to None.	`None`
`**kwargs`	`Any`	Extra arguments passed through to the Elasticsearch bulk API (e.g. refresh, timeout). No default.	`{}`

`delete(filters=None, **kwargs)` `async`

Delete records from the datastore.

Examples:

from gllm_datastore.core.filters import filter as F

await hybrid_capability.delete(filters=F.eq("metadata.source", "doc1"))
await hybrid_capability.delete(filters=F.and_(F.eq("metadata.status", "draft"), F.eq("id", "c1")))

Parameters:

Name	Type	Description	Default
`filters`	`FilterClause \| QueryFilter \| None`	Filter to select which records to delete. Call is a no-op when None to avoid accidental full-index delete. Defaults to None.	`None`
`**kwargs`	`Any`	Extra arguments passed through to the Elasticsearch delete_by_query API (e.g. refresh, timeout). No default.	`{}`

`retrieve(query, filters=None, options=None, **kwargs)` `async`

Retrieve using hybrid search with weighted score (BM25 + vector).

Combines fulltext and vector scores as weighted sum.

Examples:

from gllm_datastore.core.filters import QueryOptions
from gllm_datastore.core.filters import filter as F

results = await hybrid_capability.retrieve(
    "machine learning",
    options=QueryOptions(limit=10),
)
results_with_filter = await hybrid_capability.retrieve(
    "neural networks",
    filters=F.eq("metadata.source", "doc1"),
    options=QueryOptions(limit=5),
)

Parameters:

Name	Type	Description	Default
`query`	`str`	Search text used for both fulltext (BM25) and vector branches; query vectors are generated from this text via each VECTOR config's em_invoker.	required
`filters`	`FilterClause \| QueryFilter \| None`	Filter to restrict which documents are considered. Use FilterClause for a single condition or QueryFilter for combined conditions. Defaults to None.	`None`
`options`	`QueryOptions \| None`	Limit, sort (order_by, order_desc), offset, and include_fields. Defaults to None (limit uses DEFAULT_TOP_K).	`None`
`**kwargs`	`Any`	Extra arguments passed through to the underlying search. No default.	`{}`

Returns:

Type	Description
`list[Chunk]`	list[Chunk]: Chunks ordered by combined relevance score.

`retrieve_by_vector(query=None, dense_vectors=None, filters=None, options=None)` `async`

Hybrid search using optional query text and/or pre-computed vectors.

Builds a bool should query: each FULLTEXT contributes weight * BM25 score, each VECTOR contributes weight * cosine similarity score. Final score is the sum of these contributions (weighted score boosting).

Examples:

from gllm_datastore.core.filters import QueryOptions
from gllm_datastore.core.filters import filter as F

results = await hybrid_capability.retrieve_by_vector(
    query="machine learning",
    dense_vectors={"embedding": [0.1, 0.2, 0.3]},
    options=QueryOptions(limit=10),
)
results_multi_vector = await hybrid_capability.retrieve_by_vector(
    query="AI",
    dense_vectors={"embedding_a": vec_a, "embedding_b": vec_b},
    options=QueryOptions(limit=5),
)

Parameters:

Name	Type	Description	Default
`query`	`str \| None`	Search text for the fulltext (BM25) clause only. Omit or set to None when using only vector search. Defaults to None.	`None`
`dense_vectors`	`dict[str, Vector] \| None`	Map from VECTOR config field name to query vector. Each VECTOR config uses dense_vectors[field]. Keys must match VECTOR config field names. Defaults to None.	`None`
`filters`	`FilterClause \| QueryFilter \| None`	Filter to restrict which documents are considered. Defaults to None.	`None`
`options`	`QueryOptions \| None`	Limit, sort (order_by, order_desc), offset, and include_fields. Defaults to None (limit uses DEFAULT_TOP_K).	`None`

Returns:

Type	Description
`list[Chunk]`	list[Chunk]: Chunks ordered by combined score.

`update(update_values, filters=None, **kwargs)` `async`

Update existing records in the datastore.

Vector configs: For each VECTOR config whose field is being updated, generate embeddings from the plaintext source (content/text/fulltext field). Embeddings must be produced before encryption so the model sees plaintext.
Encryption: Encrypt all update values (content and metadata) using the encryption config. Vector fields are not encrypted; only content and metadata are. This step runs after enrichment so we never embed from encrypted text.

Examples:

from gllm_datastore.core.filters import filter as F

await hybrid_capability.update(
    update_values={"text": "Updated content"},
    filters=F.eq("metadata.source", "doc1"),
)
await hybrid_capability.update(
    update_values={"metadata": {"status": "published"}},
    filters=F.eq("id", "chunk_1"),
)

Parameters:

Name	Type	Description	Default
`update_values`	`dict[str, Any]`	Fields to update (e.g. text, content, metadata). When text or content is present, vector fields from the hybrid config are re-embedded and included automatically.	required
`filters`	`FilterClause \| QueryFilter \| None`	Filter to select which records to update. No filter means no documents match. Defaults to None.	`None`
`**kwargs`	`Any`	Extra arguments passed through to the Elasticsearch update_by_query API (e.g. refresh, timeout). No default.	`{}`

Hybrid

ElasticsearchHybridCapability(index_name, client, config, encryption=None)

fulltext_configs property

vector_configs property

clear(**kwargs) async

create(chunks, **kwargs) async

create_from_vector(chunks, dense_vectors=None, **kwargs) async

delete(filters=None, **kwargs) async

retrieve(query, filters=None, options=None, **kwargs) async

retrieve_by_vector(query=None, dense_vectors=None, filters=None, options=None) async

update(update_values, filters=None, **kwargs) async

`ElasticsearchHybridCapability(index_name, client, config, encryption=None)`

`fulltext_configs` `property`

`vector_configs` `property`

`clear(**kwargs)` `async`

`create(chunks, **kwargs)` `async`

`create_from_vector(chunks, dense_vectors=None, **kwargs)` `async`

`delete(filters=None, **kwargs)` `async`

`retrieve(query, filters=None, options=None, **kwargs)` `async`

`retrieve_by_vector(query=None, dense_vectors=None, filters=None, options=None)` `async`

`update(update_values, filters=None, **kwargs)` `async`