Skip to content

Fulltext

Elasticsearch implementation of fulltext search and CRUD capability.

Authors

Kadek Denaya (kadek.d.r.diana@gdplabs.id)

References

NONE

ElasticsearchFulltextCapability(index_name, client, query_field='text')

Elasticsearch implementation of FulltextCapability protocol.

This class provides document CRUD operations and flexible querying using Elasticsearch.

Attributes:

Name Type Description
index_name str

The name of the Elasticsearch index.

client AsyncElasticsearch

AsyncElasticsearch client.

query_field str

The field name to use for text content.

Initialize the Elasticsearch fulltext capability.

Parameters:

Name Type Description Default
index_name str

The name of the Elasticsearch index.

required
client AsyncElasticsearch

The Elasticsearch client.

required
query_field str

The field name to use for text content. Defaults to "text".

'text'

clear() async

Clear all records from the datastore.

create(data, **kwargs) async

Create new records in the datastore.

Parameters:

Name Type Description Default
data Chunk | list[Chunk]

Data to create (single item or collection).

required
**kwargs Any

Backend-specific parameters forwarded to Elasticsearch bulk API.

{}

Raises:

Type Description
ValueError

If data structure is invalid.

delete(filters=None) async

Deletes records from the data store based on filters.

Parameters:

Name Type Description Default
filters QueryFilter | None

Filters to select records for deletion. Defaults to None.

None

delete_by_id(id_) async

Deletes records from the data store based on IDs.

Parameters:

Name Type Description Default
id_ str | list[str]

ID or list of IDs to delete.

required

get_size() async

Returns the total number of vectors in the index.

Returns:

Name Type Description
int int

The total number of vectors.

retrieve(strategy=SupportedQueryMethods.BY_FIELD, query=None, filters=None, options=None, **kwargs) async

Retrieve records from the datastore using various strategies.

This is the main retrieval method that supports different search strategies: - BY_FIELD: Filter by metadata fields - BM25: Keyword-based search using BM25 algorithm - AUTOCOMPLETE: Autocomplete suggestions - AUTOSUGGEST: Autosuggest functionality - SHINGLES: Shingle-based search

Parameters:

Name Type Description Default
strategy SupportedQueryMethods

The retrieval strategy to use. Defaults to BY_FIELD.

BY_FIELD
query str | None

The query string for text-based strategies. Required for BM25, AUTOCOMPLETE, AUTOSUGGEST, SHINGLES.

None
filters QueryFilter | None

Query filters to apply. Defaults to None.

None
options QueryOptions | None

Query options (sorting, pagination, etc.). Defaults to None.

None
**kwargs Any

Additional strategy-specific parameters.

{}

Returns:

Type Description
list[Chunk] | list[str]

list[Chunk] | list[str]: Query results. Returns list[Chunk] for BY_FIELD and BM25, list[str] for AUTOCOMPLETE, AUTOSUGGEST, and SHINGLES.

Example
# Field-based retrieval
results = await data_store.retrieve(
    strategy=SupportedQueryMethods.BY_FIELD,
    filters=QueryFilter(conditions={"category": "AI"})
)

# BM25 keyword search
results = await data_store.retrieve(
    strategy=SupportedQueryMethods.BM25,
    query="machine learning",
    options=QueryOptions(limit=10)
)

# Autocomplete suggestions
suggestions = await data_store.retrieve(
    strategy=SupportedQueryMethods.AUTOCOMPLETE,
    query="artificial",
    field="title"
)

retrieve_autocomplete(query, field, size=20, fuzzy_tolerance=1, min_prefix_length=3, filter_query=None) async

Provides suggestions based on a prefix query for a specific field.

Parameters:

Name Type Description Default
query str

The query string.

required
field str

The field name for autocomplete.

required
size int

The number of suggestions to retrieve. Defaults to 20.

20
fuzzy_tolerance int

The level of fuzziness for suggestions. Defaults to 1.

1
min_prefix_length int

The minimum prefix length to trigger fuzzy matching. Defaults to 3.

3
filter_query dict[str, Any] | None

The filter query. Defaults to None.

None

Returns:

Type Description
list[str]

list[str]: A list of suggestions.

retrieve_autosuggest(query, search_fields, autocomplete_field, size=20, min_length=3, filter_query=None) async

Generates suggestions across multiple fields using a multi_match query to broaden the search criteria.

Parameters:

Name Type Description Default
query str

The query string.

required
search_fields list[str]

The fields to search for.

required
autocomplete_field str

The field name for autocomplete.

required
size int

The number of suggestions to retrieve. Defaults to 20.

20
min_length int

The minimum length of the query. Defaults to 3.

3
filter_query dict[str, Any] | None

The filter query. Defaults to None.

None

Returns:

Type Description
list[str]

list[str]: A list of suggestions.

retrieve_bm25(query, filters=None, options=None, k1=None, b=None) async

Queries the Elasticsearch data store using BM25 algorithm for keyword-based search.

Parameters:

Name Type Description Default
query str

The query string.

required
filters QueryFilter | None

Optional metadata filter to apply to the search. For example, QueryFilter(conditions={"category": "AI", "source": ["doc1", "doc2"]}). Defaults to None.

None
options QueryOptions | None

Query options including fields, limit, order_by, etc. For example, QueryOptions(fields=["title", "content"], limit=10, order_by="score", order_desc=True). If fields is None, defaults to ["text"]. For multiple fields, uses multi_match query. Defaults to None.

None
k1 float | None

BM25 parameter controlling term frequency saturation. Higher values mean term frequency has more impact before diminishing returns. Typical values: 1.2-2.0. If None, uses Elasticsearch default (~1.2). Defaults to None.

None
b float | None

BM25 parameter controlling document length normalization. 0.0 = no length normalization, 1.0 = full normalization. Typical values: 0.75. If None, uses Elasticsearch default (~0.75). Defaults to None.

None
Example
# Basic BM25 query on the 'text' field
results = await data_store.query_bm25("machine learning")

# BM25 query on specific fields with query options
results = await data_store.query_bm25(
    "natural language",
    options=QueryOptions(fields=["title", "abstract"], limit=5)
)

# BM25 query with filters
results = await data_store.query_bm25(
    "deep learning",
    filters=QueryFilter(conditions={"category": "AI", "status": "published"})
)

# BM25 query with custom BM25 parameters for more aggressive term frequency weighting
results = await data_store.query_bm25(
    "artificial intelligence",
    k1=2.0,
    b=0.5
)

# BM25 query with fields, filters, and options
results = await data_store.query_bm25(
    "data science applications",
    filters=QueryFilter(conditions={"author_id": "user123", "publication_year": [2022, 2023]}),
    options=QueryOptions(fields=["content", "tags"], limit=10, order_by="score", order_desc=True),
    k1=1.5,
    b=0.9
)

Returns:

Type Description
list[Chunk]

list[Chunk]: A list of Chunk objects representing the retrieved documents.

retrieve_by_field(filters=None, options=None, **kwargs) async

Retrieve records from the datastore based on metadata field filtering.

This method filters and returns stored chunks based on metadata values rather than text content. It is particularly useful for structured lookups, such as retrieving all chunks from a certain source, tagged with a specific label, or authored by a particular user.

Parameters:

Name Type Description Default
filters QueryFilter | None

Query filters to apply. Defaults to None.

None
options QueryOptions | None

Query options (sorting, pagination, etc.). Defaults to None.

None
**kwargs Any

Backend-specific parameters.

{}

Returns:

Type Description
list[Chunk]

list[Chunk]: The filtered results as Chunk objects.

retrieve_shingles(query, field, size=20, min_length=3, max_length=30, filter_query=None) async

Searches using shingles for prefix and fuzzy matching.

Parameters:

Name Type Description Default
query str

The query string.

required
field str

The field name for autocomplete.

required
size int

The number of suggestions to retrieve. Defaults to 20.

20
min_length int

The minimum length of the query. Queries shorter than this limit will return an empty list. Defaults to 3.

3
max_length int

The maximum length of the query. Queries exceeding this limit will return an empty list. Defaults to 30.

30
filter_query dict[str, Any] | None

The filter query. Defaults to None.

None

Returns:

Type Description
list[str]

list[str]: A list of suggestions.

update(update_values, filters=None) async

Update existing records in the datastore.

Parameters:

Name Type Description Default
update_values dict

Values to update.

required
filters QueryFilter | None

Filters to select records to update. Defaults to None.

None

SupportedQueryMethods

Bases: StrEnum

Supported query methods for Elasticsearch fulltext capability.