Skip to content

Vector Data Store

Modules containing data store implementations to be used in Gen AI applications.

ChromaVectorDataStore(collection_name, embedding=None, client_type=ChromaClientType.MEMORY, persist_directory=None, host=None, port=None, headers=None, num_candidates=DEFAULT_NUM_CANDIDATES, **kwargs)

Bases: BaseVectorDataStore, CacheCompatibleMixin

Datastore for interacting with ChromaDB.

This class provides methods to interact with ChromaDB for vector storage and retrieval using the langchain-chroma integration.

Attributes:

Name Type Description
vector_store Chroma

The langchain Chroma vector store instance.

collection_name str

The name of the ChromaDB collection to use.

num_candidates int

The maximum number of candidates to consider during search.

Initialize the ChromaDB vector data store with langchain-chroma.

Parameters:

Name Type Description Default
collection_name str

Name of the collection to use in ChromaDB.

required
embedding BaseEMInvoker | Embeddings | None

The embedding model to perform vectorization. Defaults to None.

None
client_type ChromaClientType

Type of ChromaDB client to use. Defaults to ChromaClientType.MEMORY.

MEMORY
persist_directory str | None

Directory to persist vector store data. Required for PERSISTENT client type. Defaults to None.

None
host str | None

Host address for ChromaDB server. Required for HTTP client type. Defaults to None.

None
port int | None

Port for ChromaDB server. Required for HTTP client type. Defaults to None.

None
headers dict | None

Headers for ChromaDB server. Used for HTTP client type. Defaults to None.

None
num_candidates int

Maximum number of candidates to consider during search. Defaults to DEFAULT_NUM_CANDIDATES.

DEFAULT_NUM_CANDIDATES
**kwargs Any

Additional parameters for Chroma initialization.

{}
Note

num_candidates (int, optional): This constant affects the maximum number of results to consider during the search. Index with more documents would need a higher value for the whole documents to be considered during search. This happens due to a bug with Chroma's search algorithm as discussed in this issue: [3] https://github.com/langchain-ai/langchain/issues/1946

add_chunks(chunks, **kwargs) async

Add chunks to the vector data store.

Parameters:

Name Type Description Default
chunks Chunk | list[Chunk]

A single chunk or list of chunks to add.

required
**kwargs

Additional keyword arguments for the add operation.

{}

Returns:

Type Description
list[str]

list[str]: List of IDs of the added chunks.

clear() async

Clear all entries in the storage.

Raises:

Type Description
NotImplementedError

Currently, app-level eviction is not supported for ChromaVectorDataStore.

delete_chunks(where=None, where_document=None, **kwargs) async

Delete chunks from the vector data store.

Parameters:

Name Type Description Default
where Where | None

A Where type dict used to filter the deletion by metadata. E.g. {"source": "mydoc"}. Defaults to None.

None
where_document WhereDocument | None

A WhereDocument type dict used to filter the deletion by the document content. E.g. {$contains: {"text": "hello"}}. Defaults to None.

None
**kwargs Any

Additional keyword arguments for the delete operation.

{}
Note

If no filter criteria is provided, all chunks in the collection will be deleted. Please use with caution.

delete_chunks_by_ids(ids, **kwargs) async

Delete chunks from the vector data store by IDs.

Parameters:

Name Type Description Default
ids str | list[str]

A single ID or a list of IDs to delete.

required
**kwargs Any

Additional keyword arguments.

{}
Note

If no IDs are provided, no chunks will be deleted.

delete_entries_by_key(key) async

Delete entries by key.

Parameters:

Name Type Description Default
key str

The key to delete entries for.

required

Raises:

Type Description
NotImplementedError

Currently, app-level eviction is not supported for ChromaVectorDataStore.

delete_expired_entries(now, max_size=10000) async

Delete expired entries (for TTL eviction).

Parameters:

Name Type Description Default
now datetime

The current datetime for comparison.

required
max_size int

The maximum number of entries to return. Defaults to 10000.

10000

Raises:

Type Description
NotImplementedError

Currently, app-level eviction is not supported for ChromaVectorDataStore.

delete_least_frequently_used_entries(num_entries) async

Delete least frequently used entries (for LFU eviction).

Parameters:

Name Type Description Default
num_entries int

Number of entries to return.

required

Raises:

Type Description
NotImplementedError

Currently, app-level eviction is not supported for ChromaVectorDataStore.

delete_least_recently_used_entries(num_entries) async

Delete least recently used entries (for LRU eviction).

Parameters:

Name Type Description Default
num_entries int

Number of entries to return.

required

Raises:

Type Description
NotImplementedError

Currently, app-level eviction is not supported for ChromaVectorDataStore.

exact_match(key, metadata=None) async

Find chunks that exactly match the given key.

This method searches for documents with the exact original_key in metadata.

Parameters:

Name Type Description Default
key str

The key to match.

required
metadata dict[str, Any] | None

Optional metadata filter to apply to the search. For example, {"key": "value"}. Defaults to None.

None

Returns:

Name Type Description
Any Any | None

The value stored with the exact key match, or None if no match is found.

fuzzy_match(key, max_distance=2, metadata=None) async

Find chunks that approximately match the given key using fuzzy matching.

Parameters:

Name Type Description Default
key str

The key to match.

required
max_distance int

Maximum allowed Levenshtein distance for fuzzy matching. Higher values are more lenient. Defaults to 2.

2
metadata dict[str, Any] | None

Optional metadata filter to apply to the search. For example, {"key": "value"}. Defaults to None.

None

Returns:

Name Type Description
Any Any | None

The value with the closest fuzzy match to the key, or None if no match meets the threshold.

query(query, top_k=DEFAULT_TOP_K, retrieval_params=None) async

Query the vector data store for similar chunks with similarity scores.

Parameters:

Name Type Description Default
query str

The query string to find similar chunks for.

required
top_k int

Maximum number of results to return. Defaults to DEFAULT_TOP_K.

DEFAULT_TOP_K
retrieval_params dict[str, Any] | None

Additional parameters for retrieval. - filter (Where, optional): A Where type dict used to filter the retrieval by the metadata keys. E.g. {"$and": [{"color" : "red"}, {"price": {"$gte": 4.20}]}}. - where_document (WhereDocument, optional): A WhereDocument type dict used to filter the retrieval by the document content. E.g. {$contains: {"text": "hello"}}. Defaults to None.

None

Returns:

Type Description
list[Chunk]

list[Chunk]: A list of Chunk objects matching the query, with similarity scores.

query_by_id(id) async

Retrieve chunks by their IDs.

Parameters:

Name Type Description Default
id str | list[str]

A single ID or a list of IDs to retrieve.

required

Returns:

Type Description
list[Chunk]

list[Chunk]: A list of retrieved Chunk objects.

semantic_match(key, min_similarity=0.2, metadata=None) async

Find chunks that semantically match the given key using vector similarity.

Parameters:

Name Type Description Default
key str

The key to match.

required
min_similarity float

Minimum similarity score for semantic matching (higher values are more strict). Ranges from 0 to 1. Defaults to 0.8.

0.2
metadata dict[str, Any] | None

Optional metadata filter to apply to the search. For example, {"key": "value"}. Defaults to None.

None

Returns:

Name Type Description
Any Any | None

The semantically closest value, or None if no match meets the min_similarity.

ElasticsearchVectorDataStore(index_name, embedding=None, connection=None, url=None, cloud_id=None, user=None, api_key=None, password=None, vector_query_field='vector', query_field='text', distance_strategy=None, strategy=None, request_timeout=DEFAULT_REQUEST_TIMEOUT)

Bases: BaseVectorDataStore, CacheCompatibleMixin

DataStore for interacting with Elasticsearch.

This class provides methods for executing queries and retrieving documents from Elasticsearch. It relies on the LangChain's ElasticsearchStore for vector operations and the underlying Elasticsearch client management.

Attributes:

Name Type Description
vector_store ElasticsearchStore

The ElasticsearchStore instance for vector operations.

index_name str

The name of the Elasticsearch index.

logger Logger

The logger object.

Initializes an instance of the ElasticsearchVectorDataStore class.

Parameters:

Name Type Description Default
index_name str

The name of the Elasticsearch index.

required
embedding BaseEMInvoker | Embeddings | None

The embedding model to perform vectorization. Defaults to None.

None
connection Any | None

The Elasticsearch connection object. Defaults to None.

None
url str | None

The URL of the Elasticsearch server. Defaults to None.

None
cloud_id str | None

The cloud ID of the Elasticsearch cluster. Defaults to None.

None
user str | None

The username for authentication. Defaults to None.

None
api_key str | None

The API key for authentication. Defaults to None.

None
password str | None

The password for authentication. Defaults to None.

None
vector_query_field str

The field name for vector queries. Defaults to "vector".

'vector'
query_field str

The field name for text queries. Defaults to "text".

'text'
distance_strategy str | None

The distance strategy for retrieval. Defaults to None.

None
strategy Any | None

The retrieval strategy for retrieval. Defaults to None, in which case DenseVectorStrategy() is used.

None
request_timeout int

The request timeout. Defaults to DEFAULT_REQUEST_TIMEOUT.

DEFAULT_REQUEST_TIMEOUT

Raises:

Type Description
TypeError

If embedding is not an instance of BaseEMInvoker or Embeddings.

add_chunks(chunk, **kwargs) async

Adds a chunk or a list of chunks to the data store.

Parameters:

Name Type Description Default
chunk Chunk | list[Chunk]

The chunk or list of chunks to add.

required
kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
list[str]

list[str]: A list of unique identifiers (IDs) assigned to the added chunks.

add_embeddings(text_embeddings, metadatas=None, ids=None, **kwargs) async

Adds text embeddings to the data store.

Parameters:

Name Type Description Default
text_embeddings list[tuple[str, list[float]]]

Pairs of string and embedding to add to the store.

required
metadatas list[dict]

Optional list of metadatas associated with the texts. Defaults to None.

None
ids list[str]

Optional list of unique IDs. Defaults to None.

None
kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
list[str]

list[str]: A list of unique identifiers (IDs) assigned to the added embeddings.

autocomplete(query, field, size=20, fuzzy_tolerance=1, min_prefix_length=3, filter_query=None) async

Provides suggestions based on a prefix query for a specific field.

Parameters:

Name Type Description Default
query str

The query string.

required
field str

The field name for autocomplete.

required
size int

The number of suggestions to retrieve. Defaults to 20.

20
fuzzy_tolerance int

The level of fuzziness for suggestions. Defaults to 1.

1
min_prefix_length int

The minimum prefix length to trigger fuzzy matching. Defaults to 3.

3
filter_query dict[str, Any] | None

The filter query. Defaults to None.

None

Returns:

Type Description
list[str]

list[str]: A list of suggestions.

autosuggest(query, search_fields, autocomplete_field, size=20, min_length=3, filter_query=None) async

Generates suggestions across multiple fields using a multi_match query to broaden the search criteria.

Parameters:

Name Type Description Default
query str

The query string.

required
search_fields list[str]

The fields to search for.

required
autocomplete_field str

The field name for autocomplete.

required
size int

The number of suggestions to retrieve. Defaults to 20.

20
min_length int

The minimum length of the query. Defaults to 3.

3
filter_query dict[str, Any] | None

The filter query. Defaults to None.

None

Returns:

Type Description
list[str]

list[str]: A list of suggestions.

clear() async

Clear all entries in the storage.

Raises:

Type Description
NotImplementedError

Currently, app-level eviction is not supported for ElasticsearchVectorDataStore.

delete_chunks(query, **kwargs) async

Deletes chunks from the data store based on a query.

Parameters:

Name Type Description Default
query dict[str, Any]

Query to match documents for deletion.

required
kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
None

None

delete_chunks_by_ids(ids, **kwargs) async

Deletes chunks from the data store based on IDs.

Parameters:

Name Type Description Default
ids str | list[str]

A single ID or a list of IDs to delete.

required
kwargs Any

Additional keyword arguments.

{}

delete_entries_by_key(key) async

Delete entries by key.

Parameters:

Name Type Description Default
key str

The key to delete entries for.

required

Raises:

Type Description
NotImplementedError

Currently, app-level eviction is not supported for ElasticsearchVectorDataStore.

delete_expired_entries(now, max_size=10000) async

Delete expired entries (for TTL eviction).

Parameters:

Name Type Description Default
now datetime

The current datetime for comparison.

required
max_size int

The maximum number of entries to return. Defaults to 10000.

10000

Returns:

Type Description
None

None

delete_least_frequently_used_entries(num_entries) async

Delete least frequently used entries (for LFU eviction).

Parameters:

Name Type Description Default
num_entries int

Number of entries to return.

required

Returns:

Type Description
None

None

delete_least_recently_used_entries(num_entries) async

Delete least recently used entries (for LRU eviction).

Parameters:

Name Type Description Default
num_entries int

Number of entries to return.

required

Returns:

Type Description
None

None

exact_match(key, metadata=None) async

Find chunks that exactly match the given key.

Parameters:

Name Type Description Default
key str

The key to match.

required
metadata dict[str, Any] | None

Optional metadata filter to apply to the search. For example, {"key": "value"}. Defaults to None.

None

Returns:

Name Type Description
Any Any | None

The value stored with the exact key, or None if no match is found.

fuzzy_match(key, max_distance=2, metadata=None) async

Find chunks that approximately match the given key using fuzzy matching.

Parameters:

Name Type Description Default
key str

The key to match.

required
max_distance int

The maximum distance for fuzzy matching. Defaults to 2. Ranges from 0 to 2. Higher values are more lenient.

2
metadata dict[str, Any] | None

Optional metadata filter to apply to the search. For example, {"key": "value"}. Defaults to None.

None

Returns:

Name Type Description
Any Any | None

The value with the closest fuzzy match, or None if no match meets the threshold.

query(query, top_k=DEFAULT_TOP_K, retrieval_params=None) async

Queries the Elasticsearch data store and includes similarity scores.

Parameters:

Name Type Description Default
query str

The query string.

required
top_k int

The number of top results to retrieve. Defaults to DEFAULT_TOP_K.

DEFAULT_TOP_K
retrieval_params dict[str, Any] | None

Additional retrieval parameters. Defaults to None.

None

Returns:

Type Description
list[Chunk]

list[Chunk]: A list of Chunk objects representing the retrieved documents with similarity scores.

query_by_id(id_) async

Queries the data store by ID and returns a list of Chunk objects.

Parameters:

Name Type Description Default
id_ str | list[str]

The ID of the document to query.

required

Returns:

Type Description
list[Chunk]

A list of Chunk objects representing the queried documents.

Note

This method not implement yet. Because the ElasticsearchStore still not implement the get_by_ids method yet.

semantic_match(key, min_similarity=0.8, metadata=None) async

Find chunks that semantically match the given key using vector similarity.

Parameters:

Name Type Description Default
key str

The key to match.

required
min_similarity float

Minimum similarity score for semantic matching (higher values are more strict). Ranges from 0 to 1. Defaults to 0.8.

0.8
metadata dict[str, Any] | None

Optional metadata filter to apply to the search. For example, {"key": "value"}. Defaults to None.

None

Returns:

Name Type Description
Any Any | None

The semantically closest value, or None if no match meets the min_similarity threshold.

shingles(query, field, size=20, min_length=3, filter_query=None) async

Searches using shingles for prefix and fuzzy matching.

Parameters:

Name Type Description Default
query str

The query string.

required
field str

The field name for autocomplete.

required
size int

The number of suggestions to retrieve. Defaults to 20.

20
min_length int

The minimum length of the query. Defaults to 3.

3
filter_query dict[str, Any] | None

The filter query. Defaults to None.

None

Returns:

Type Description
list[str]

list[str]: A list of suggestions.