Vector Data Store
Modules containing data store implementations to be used in Gen AI applications.
ChromaVectorDataStore(collection_name, embedding=None, client_type=ChromaClientType.MEMORY, persist_directory=None, host=None, port=None, headers=None, num_candidates=DEFAULT_NUM_CANDIDATES, **kwargs)
Bases: BaseVectorDataStore
, CacheCompatibleMixin
Datastore for interacting with ChromaDB.
This class provides methods to interact with ChromaDB for vector storage and retrieval using the langchain-chroma integration.
Attributes:
Name | Type | Description |
---|---|---|
vector_store |
Chroma
|
The langchain Chroma vector store instance. |
collection_name |
str
|
The name of the ChromaDB collection to use. |
num_candidates |
int
|
The maximum number of candidates to consider during search. |
Initialize the ChromaDB vector data store with langchain-chroma.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
collection_name |
str
|
Name of the collection to use in ChromaDB. |
required |
embedding |
BaseEMInvoker | Embeddings | None
|
The embedding model to perform vectorization. Defaults to None. |
None
|
client_type |
ChromaClientType
|
Type of ChromaDB client to use. Defaults to ChromaClientType.MEMORY. |
MEMORY
|
persist_directory |
str | None
|
Directory to persist vector store data. Required for PERSISTENT client type. Defaults to None. |
None
|
host |
str | None
|
Host address for ChromaDB server. Required for HTTP client type. Defaults to None. |
None
|
port |
int | None
|
Port for ChromaDB server. Required for HTTP client type. Defaults to None. |
None
|
headers |
dict | None
|
Headers for ChromaDB server. Used for HTTP client type. Defaults to None. |
None
|
num_candidates |
int
|
Maximum number of candidates to consider during search. Defaults to DEFAULT_NUM_CANDIDATES. |
DEFAULT_NUM_CANDIDATES
|
**kwargs |
Any
|
Additional parameters for Chroma initialization. |
{}
|
Note
num_candidates (int, optional): This constant affects the maximum number of results to consider during the search. Index with more documents would need a higher value for the whole documents to be considered during search. This happens due to a bug with Chroma's search algorithm as discussed in this issue: [3] https://github.com/langchain-ai/langchain/issues/1946
add_chunks(chunks, **kwargs)
async
clear()
async
Clear all entries in the storage.
Raises:
Type | Description |
---|---|
NotImplementedError
|
Currently, app-level eviction is not supported for ChromaVectorDataStore. |
delete_chunks(where=None, where_document=None, **kwargs)
async
Delete chunks from the vector data store.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
where |
Where | None
|
A Where type dict used to filter the deletion by metadata.
E.g. |
None
|
where_document |
WhereDocument | None
|
A WhereDocument type dict used to filter the deletion by
the document content. E.g. |
None
|
**kwargs |
Any
|
Additional keyword arguments for the delete operation. |
{}
|
Note
If no filter criteria is provided, all chunks in the collection will be deleted. Please use with caution.
delete_chunks_by_ids(ids, **kwargs)
async
Delete chunks from the vector data store by IDs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ids |
str | list[str]
|
A single ID or a list of IDs to delete. |
required |
**kwargs |
Any
|
Additional keyword arguments. |
{}
|
Note
If no IDs are provided, no chunks will be deleted.
delete_entries_by_key(key)
async
Delete entries by key.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
The key to delete entries for. |
required |
Raises:
Type | Description |
---|---|
NotImplementedError
|
Currently, app-level eviction is not supported for ChromaVectorDataStore. |
delete_expired_entries(now, max_size=10000)
async
Delete expired entries (for TTL eviction).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
now |
datetime
|
The current datetime for comparison. |
required |
max_size |
int
|
The maximum number of entries to return. Defaults to 10000. |
10000
|
Raises:
Type | Description |
---|---|
NotImplementedError
|
Currently, app-level eviction is not supported for ChromaVectorDataStore. |
delete_least_frequently_used_entries(num_entries)
async
Delete least frequently used entries (for LFU eviction).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
num_entries |
int
|
Number of entries to return. |
required |
Raises:
Type | Description |
---|---|
NotImplementedError
|
Currently, app-level eviction is not supported for ChromaVectorDataStore. |
delete_least_recently_used_entries(num_entries)
async
Delete least recently used entries (for LRU eviction).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
num_entries |
int
|
Number of entries to return. |
required |
Raises:
Type | Description |
---|---|
NotImplementedError
|
Currently, app-level eviction is not supported for ChromaVectorDataStore. |
exact_match(key, metadata=None)
async
Find chunks that exactly match the given key.
This method searches for documents with the exact original_key in metadata.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
The key to match. |
required |
metadata |
dict[str, Any] | None
|
Optional metadata filter to apply to the search.
For example, |
None
|
Returns:
Name | Type | Description |
---|---|---|
Any |
Any | None
|
The value stored with the exact key match, or None if no match is found. |
fuzzy_match(key, max_distance=2, metadata=None)
async
Find chunks that approximately match the given key using fuzzy matching.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
The key to match. |
required |
max_distance |
int
|
Maximum allowed Levenshtein distance for fuzzy matching. Higher values are more lenient. Defaults to 2. |
2
|
metadata |
dict[str, Any] | None
|
Optional metadata filter to apply to the search.
For example, |
None
|
Returns:
Name | Type | Description |
---|---|---|
Any |
Any | None
|
The value with the closest fuzzy match to the key, or None if no match meets the threshold. |
query(query, top_k=DEFAULT_TOP_K, retrieval_params=None)
async
Query the vector data store for similar chunks with similarity scores.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
str
|
The query string to find similar chunks for. |
required |
top_k |
int
|
Maximum number of results to return. Defaults to DEFAULT_TOP_K. |
DEFAULT_TOP_K
|
retrieval_params |
dict[str, Any] | None
|
Additional parameters for retrieval.
- filter (Where, optional): A Where type dict used to filter the retrieval by the metadata keys.
E.g. |
None
|
Returns:
Type | Description |
---|---|
list[Chunk]
|
list[Chunk]: A list of Chunk objects matching the query, with similarity scores. |
query_by_id(id)
async
Retrieve chunks by their IDs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id |
str | list[str]
|
A single ID or a list of IDs to retrieve. |
required |
Returns:
Type | Description |
---|---|
list[Chunk]
|
list[Chunk]: A list of retrieved Chunk objects. |
semantic_match(key, min_similarity=0.2, metadata=None)
async
Find chunks that semantically match the given key using vector similarity.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
The key to match. |
required |
min_similarity |
float
|
Minimum similarity score for semantic matching (higher values are more strict). Ranges from 0 to 1. Defaults to 0.8. |
0.2
|
metadata |
dict[str, Any] | None
|
Optional metadata filter to apply to the search.
For example, |
None
|
Returns:
Name | Type | Description |
---|---|---|
Any |
Any | None
|
The semantically closest value, or None if no match meets the min_similarity. |
ElasticsearchVectorDataStore(index_name, embedding=None, connection=None, url=None, cloud_id=None, user=None, api_key=None, password=None, vector_query_field='vector', query_field='text', distance_strategy=None, strategy=None, request_timeout=DEFAULT_REQUEST_TIMEOUT)
Bases: BaseVectorDataStore
, CacheCompatibleMixin
DataStore for interacting with Elasticsearch.
This class provides methods for executing queries and retrieving documents from Elasticsearch. It relies on the LangChain's ElasticsearchStore for vector operations and the underlying Elasticsearch client management.
Attributes:
Name | Type | Description |
---|---|---|
vector_store |
ElasticsearchStore
|
The ElasticsearchStore instance for vector operations. |
index_name |
str
|
The name of the Elasticsearch index. |
logger |
Logger
|
The logger object. |
Initializes an instance of the ElasticsearchVectorDataStore class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_name |
str
|
The name of the Elasticsearch index. |
required |
embedding |
BaseEMInvoker | Embeddings | None
|
The embedding model to perform vectorization. Defaults to None. |
None
|
connection |
Any | None
|
The Elasticsearch connection object. Defaults to None. |
None
|
url |
str | None
|
The URL of the Elasticsearch server. Defaults to None. |
None
|
cloud_id |
str | None
|
The cloud ID of the Elasticsearch cluster. Defaults to None. |
None
|
user |
str | None
|
The username for authentication. Defaults to None. |
None
|
api_key |
str | None
|
The API key for authentication. Defaults to None. |
None
|
password |
str | None
|
The password for authentication. Defaults to None. |
None
|
vector_query_field |
str
|
The field name for vector queries. Defaults to "vector". |
'vector'
|
query_field |
str
|
The field name for text queries. Defaults to "text". |
'text'
|
distance_strategy |
str | None
|
The distance strategy for retrieval. Defaults to None. |
None
|
strategy |
Any | None
|
The retrieval strategy for retrieval. Defaults to None, in which case DenseVectorStrategy() is used. |
None
|
request_timeout |
int
|
The request timeout. Defaults to DEFAULT_REQUEST_TIMEOUT. |
DEFAULT_REQUEST_TIMEOUT
|
Raises:
Type | Description |
---|---|
TypeError
|
If |
add_chunks(chunk, **kwargs)
async
Adds a chunk or a list of chunks to the data store.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk |
Chunk | list[Chunk]
|
The chunk or list of chunks to add. |
required |
kwargs |
Any
|
Additional keyword arguments. |
{}
|
Returns:
Type | Description |
---|---|
list[str]
|
list[str]: A list of unique identifiers (IDs) assigned to the added chunks. |
add_embeddings(text_embeddings, metadatas=None, ids=None, **kwargs)
async
Adds text embeddings to the data store.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text_embeddings |
list[tuple[str, list[float]]]
|
Pairs of string and embedding to add to the store. |
required |
metadatas |
list[dict]
|
Optional list of metadatas associated with the texts. Defaults to None. |
None
|
ids |
list[str]
|
Optional list of unique IDs. Defaults to None. |
None
|
kwargs |
Any
|
Additional keyword arguments. |
{}
|
Returns:
Type | Description |
---|---|
list[str]
|
list[str]: A list of unique identifiers (IDs) assigned to the added embeddings. |
autocomplete(query, field, size=20, fuzzy_tolerance=1, min_prefix_length=3, filter_query=None)
async
Provides suggestions based on a prefix query for a specific field.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
str
|
The query string. |
required |
field |
str
|
The field name for autocomplete. |
required |
size |
int
|
The number of suggestions to retrieve. Defaults to 20. |
20
|
fuzzy_tolerance |
int
|
The level of fuzziness for suggestions. Defaults to 1. |
1
|
min_prefix_length |
int
|
The minimum prefix length to trigger fuzzy matching. Defaults to 3. |
3
|
filter_query |
dict[str, Any] | None
|
The filter query. Defaults to None. |
None
|
Returns:
Type | Description |
---|---|
list[str]
|
list[str]: A list of suggestions. |
autosuggest(query, search_fields, autocomplete_field, size=20, min_length=3, filter_query=None)
async
Generates suggestions across multiple fields using a multi_match query to broaden the search criteria.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
str
|
The query string. |
required |
search_fields |
list[str]
|
The fields to search for. |
required |
autocomplete_field |
str
|
The field name for autocomplete. |
required |
size |
int
|
The number of suggestions to retrieve. Defaults to 20. |
20
|
min_length |
int
|
The minimum length of the query. Defaults to 3. |
3
|
filter_query |
dict[str, Any] | None
|
The filter query. Defaults to None. |
None
|
Returns:
Type | Description |
---|---|
list[str]
|
list[str]: A list of suggestions. |
clear()
async
Clear all entries in the storage.
Raises:
Type | Description |
---|---|
NotImplementedError
|
Currently, app-level eviction is not supported for ElasticsearchVectorDataStore. |
delete_chunks(query, **kwargs)
async
Deletes chunks from the data store based on a query.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
dict[str, Any]
|
Query to match documents for deletion. |
required |
kwargs |
Any
|
Additional keyword arguments. |
{}
|
Returns:
Type | Description |
---|---|
None
|
None |
delete_chunks_by_ids(ids, **kwargs)
async
Deletes chunks from the data store based on IDs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ids |
str | list[str]
|
A single ID or a list of IDs to delete. |
required |
kwargs |
Any
|
Additional keyword arguments. |
{}
|
delete_entries_by_key(key)
async
Delete entries by key.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
The key to delete entries for. |
required |
Raises:
Type | Description |
---|---|
NotImplementedError
|
Currently, app-level eviction is not supported for ElasticsearchVectorDataStore. |
delete_expired_entries(now, max_size=10000)
async
Delete expired entries (for TTL eviction).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
now |
datetime
|
The current datetime for comparison. |
required |
max_size |
int
|
The maximum number of entries to return. Defaults to 10000. |
10000
|
Returns:
Type | Description |
---|---|
None
|
None |
delete_least_frequently_used_entries(num_entries)
async
Delete least frequently used entries (for LFU eviction).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
num_entries |
int
|
Number of entries to return. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
delete_least_recently_used_entries(num_entries)
async
Delete least recently used entries (for LRU eviction).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
num_entries |
int
|
Number of entries to return. |
required |
Returns:
Type | Description |
---|---|
None
|
None |
exact_match(key, metadata=None)
async
Find chunks that exactly match the given key.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
The key to match. |
required |
metadata |
dict[str, Any] | None
|
Optional metadata filter to apply to the search.
For example, |
None
|
Returns:
Name | Type | Description |
---|---|---|
Any |
Any | None
|
The value stored with the exact key, or None if no match is found. |
fuzzy_match(key, max_distance=2, metadata=None)
async
Find chunks that approximately match the given key using fuzzy matching.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
The key to match. |
required |
max_distance |
int
|
The maximum distance for fuzzy matching. Defaults to 2. Ranges from 0 to 2. Higher values are more lenient. |
2
|
metadata |
dict[str, Any] | None
|
Optional metadata filter to apply to the search.
For example, |
None
|
Returns:
Name | Type | Description |
---|---|---|
Any |
Any | None
|
The value with the closest fuzzy match, or None if no match meets the threshold. |
query(query, top_k=DEFAULT_TOP_K, retrieval_params=None)
async
Queries the Elasticsearch data store and includes similarity scores.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
str
|
The query string. |
required |
top_k |
int
|
The number of top results to retrieve. Defaults to DEFAULT_TOP_K. |
DEFAULT_TOP_K
|
retrieval_params |
dict[str, Any] | None
|
Additional retrieval parameters. Defaults to None. |
None
|
Returns:
Type | Description |
---|---|
list[Chunk]
|
list[Chunk]: A list of Chunk objects representing the retrieved documents with similarity scores. |
query_by_id(id_)
async
Queries the data store by ID and returns a list of Chunk objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id_ |
str | list[str]
|
The ID of the document to query. |
required |
Returns:
Type | Description |
---|---|
list[Chunk]
|
A list of Chunk objects representing the queried documents. |
Note
This method not implement yet. Because the ElasticsearchStore still not implement the get_by_ids method yet.
semantic_match(key, min_similarity=0.8, metadata=None)
async
Find chunks that semantically match the given key using vector similarity.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
The key to match. |
required |
min_similarity |
float
|
Minimum similarity score for semantic matching (higher values are more strict). Ranges from 0 to 1. Defaults to 0.8. |
0.8
|
metadata |
dict[str, Any] | None
|
Optional metadata filter to apply to the search.
For example, |
None
|
Returns:
Name | Type | Description |
---|---|---|
Any |
Any | None
|
The semantically closest value, or None if no match meets the min_similarity threshold. |
shingles(query, field, size=20, min_length=3, filter_query=None)
async
Searches using shingles for prefix and fuzzy matching.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
str
|
The query string. |
required |
field |
str
|
The field name for autocomplete. |
required |
size |
int
|
The number of suggestions to retrieve. Defaults to 20. |
20
|
min_length |
int
|
The minimum length of the query. Defaults to 3. |
3
|
filter_query |
dict[str, Any] | None
|
The filter query. Defaults to None. |
None
|
Returns:
Type | Description |
---|---|
list[str]
|
list[str]: A list of suggestions. |