Vector
Vector indexer module.
VectorDBIndexer(data_store_map=None, em_invoker_map=None, cache_size=DEFAULT_CACHE_SIZE, retryable_exceptions=None)
Bases: BaseIndexer
Index elements into a vector datastore capability.
Initialize the indexer with mappings for vector DB capabilities and embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_store_map
|
dict[str, Type[BaseDataStore]] | None
|
Mapping of db_engine strings to BaseDataStore classes. If not provided, uses DEFAULT_DATA_STORE_MAP which includes "chroma", "elasticsearch", and "opensearch". Defaults to None. |
None
|
em_invoker_map
|
dict[str, Type[BaseEMInvoker]] | None
|
Mapping of provider strings to embedding classes (BaseEMInvoker subclasses). If not provided, uses DEFAULT_EM_INVOKER_MAP which includes "azure-openai", "bedrock", "google", "openai", and "voyage". Defaults to None. |
None
|
cache_size
|
int
|
Maximum number of vector capability instances to cache using LRU policy. Defaults to DEFAULT_CACHE_SIZE (128). |
DEFAULT_CACHE_SIZE
|
retryable_exceptions
|
tuple[type[Exception], ...] | None
|
Tuple of exception types to retry on during batch processing. Defaults to DEFAULT_RETRYABLE_EXCEPTIONS. |
None
|
delete(**kwargs)
Delete documents from the vector capability based on the file ID.
This method validates that file_id is present in kwargs and delegates to delete_file_chunks.
Kwargs
file_id (str): The ID of the file(s) to be deleted. vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status and error message. - success (bool): True if deletion succeeded, False otherwise. - error_message (str): Error message if deletion failed, empty string otherwise. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file_id is not provided in kwargs. |
delete_chunk(chunk_id, file_id, **kwargs)
Delete a single chunk by chunk ID and file ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to delete. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status and error message. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |
delete_file_chunks(file_id, **kwargs)
Delete all chunks for a specific file.
- No index: treated as success (nothing to delete).
- Index exists, no matching chunks: success.
- Index exists, matching chunks: success if delete succeeds, otherwise failed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_id
|
str
|
The ID of the file whose chunks should be deleted. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Kwargs
vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status and error message. |
get_chunk(chunk_id, file_id, **kwargs)
Get a single chunk by chunk ID and file ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to retrieve. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None
|
dict[str, Any] | None: The chunk data, or None if not found. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |
get_file_chunks(file_id, page=0, size=20, **kwargs)
Get chunks for a specific file with pagination support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_id
|
str
|
The ID of the file to get chunks from. |
required |
page
|
int
|
The page number (0-indexed). Defaults to 0. |
0
|
size
|
int
|
The number of chunks per page. Defaults to 20. |
20
|
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with chunks list, total count, and pagination info. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |
index(elements, **kwargs)
Index elements into the configured vector capability.
This method validates that file_id is present in kwargs and delegates to index_file_chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
Parsed elements containing text and metadata. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Kwargs
file_id (str): The ID of the file these chunks belong to. vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration. batch_size (int, optional): The number of chunks to process in each batch. Defaults to 100. max_retries (int, optional): The maximum number of retry attempts for failed batches. Defaults to 3.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status, error message, and total count. - success (bool): True if indexing succeeded, False otherwise. - error_message (str): Error message if indexing failed, empty string otherwise. - total (int): The total number of chunks indexed. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If file_id is not provided in kwargs. |
index_chunk(element, **kwargs)
Index a single chunk.
This method only indexes the chunk. It does NOT update the metadata of neighboring chunks (previous_chunk/next_chunk). The caller is responsible for maintaining chunk relationships by updating adjacent chunks' metadata separately.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element
|
dict[str, Any]
|
The chunk to be indexed. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status, error message, and chunk_id. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |
index_file_chunks(elements, file_id, **kwargs)
Index chunks for a specific file.
This method indexes chunks for a file. The indexer is responsible for deleting any existing chunks for the file_id before indexing the new chunks to ensure consistency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
The chunks to be indexed. |
required |
file_id
|
str
|
The ID of the file these chunks belong to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Kwargs
vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration. batch_size (int, optional): The number of chunks to process in each batch. Defaults to 100. max_retries (int, optional): The maximum number of retry attempts for failed batches. Defaults to 3.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status, error message, and total count. |
update_chunk(element, **kwargs)
Update a chunk by chunk ID.
This method updates both the text content and metadata of a chunk. When text content is updated, the chunk should be re-processed through data generators and re-indexed with updated vector embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element
|
dict[str, Any]
|
The updated chunk data. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status, error message, and chunk_id. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |
update_chunk_metadata(chunk_id, file_id, metadata, **kwargs)
Update metadata for a specific chunk.
This method patches new metadata into the existing chunk metadata. Existing metadata fields will be overwritten, and new fields will be added. System-managed metadata fields (file_id, chunk_id, etc.) should be preserved and not overwritten.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to update. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
metadata
|
dict[str, Any]
|
The metadata fields to update. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response with success status and error message. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented. |