Vector

Vector indexer module.

`VectorDBIndexer(data_store_map=None, em_invoker_map=None, cache_size=DEFAULT_CACHE_SIZE, retryable_exceptions=None)`

Bases: BaseIndexer

Index elements into a vector datastore capability.

Initialize the indexer with mappings for vector DB capabilities and embeddings.

Parameters:

Name	Type	Description	Default
`data_store_map`	`dict[str, Type[BaseDataStore]] \| None`	Mapping of db_engine strings to BaseDataStore classes. If not provided, uses DEFAULT_DATA_STORE_MAP which includes "chroma", "elasticsearch", and "opensearch". Defaults to None.	`None`
`em_invoker_map`	`dict[str, Type[BaseEMInvoker]] \| None`	Mapping of provider strings to embedding classes (BaseEMInvoker subclasses). If not provided, uses DEFAULT_EM_INVOKER_MAP which includes "azure-openai", "bedrock", "google", "openai", and "voyage". Defaults to None.	`None`
`cache_size`	`int`	Maximum number of vector capability instances to cache using LRU policy. Defaults to DEFAULT_CACHE_SIZE (128).	`DEFAULT_CACHE_SIZE`
`retryable_exceptions`	`tuple[type[Exception], ...] \| None`	Tuple of exception types to retry on during batch processing. Defaults to DEFAULT_RETRYABLE_EXCEPTIONS.	`None`

`delete(**kwargs)`

Delete documents from the vector capability based on the file ID.

This method validates that file_id is present in kwargs and delegates to delete_file_chunks.

Kwargs

file_id (str): The ID of the file(s) to be deleted. vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: Response with success status and error message. - success (bool): True if deletion succeeded, False otherwise. - error_message (str): Error message if deletion failed, empty string otherwise.

Raises:

Type	Description
`ValueError`	If file_id is not provided in kwargs.

`delete_chunk(chunk_id, file_id, **kwargs)`

Delete a single chunk by chunk ID and file ID.

Parameters:

Name	Type	Description	Default
`chunk_id`	`str`	The ID of the chunk to delete.	required
`file_id`	`str`	The ID of the file the chunk belongs to.	required
`**kwargs`	`Any`	Additional keyword arguments for customization.	`{}`

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: Response with success status and error message.

Raises:

Type	Description
`NotImplementedError`	This method is not yet implemented.

`delete_file_chunks(file_id, **kwargs)`

Delete all chunks for a specific file.

No index: treated as success (nothing to delete).
Index exists, no matching chunks: success.
Index exists, matching chunks: success if delete succeeds, otherwise failed.

Parameters:

Name	Type	Description	Default
`file_id`	`str`	The ID of the file whose chunks should be deleted.	required
`**kwargs`	`Any`	Additional keyword arguments for customization.	`{}`

Kwargs

vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: Response with success status and error message.

`get_chunk(chunk_id, file_id, **kwargs)`

Get a single chunk by chunk ID and file ID.

Parameters:

Name	Type	Description	Default
`chunk_id`	`str`	The ID of the chunk to retrieve.	required
`file_id`	`str`	The ID of the file the chunk belongs to.	required
`**kwargs`	`Any`	Additional keyword arguments for customization.	`{}`

Returns:

Type	Description
`dict[str, Any] \| None`	dict[str, Any] \| None: The chunk data, or None if not found.

Raises:

Type	Description
`NotImplementedError`	This method is not yet implemented.

`get_file_chunks(file_id, page=0, size=20, **kwargs)`

Get chunks for a specific file with pagination support.

Parameters:

Name	Type	Description	Default
`file_id`	`str`	The ID of the file to get chunks from.	required
`page`	`int`	The page number (0-indexed). Defaults to 0.	`0`
`size`	`int`	The number of chunks per page. Defaults to 20.	`20`
`**kwargs`	`Any`	Additional keyword arguments for customization.	`{}`

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: Response with chunks list, total count, and pagination info.

Raises:

Type	Description
`NotImplementedError`	This method is not yet implemented.

`index(elements, **kwargs)`

Index elements into the configured vector capability.

This method validates that file_id is present in kwargs and delegates to index_file_chunks.

Parameters:

Name	Type	Description	Default
`elements`	`list[dict[str, Any]]`	Parsed elements containing text and metadata.	required
`**kwargs`	`Any`	Additional keyword arguments for customization.	`{}`

Kwargs

file_id (str): The ID of the file these chunks belong to. vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration. batch_size (int, optional): The number of chunks to process in each batch. Defaults to 100. max_retries (int, optional): The maximum number of retry attempts for failed batches. Defaults to 3.

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: Response with success status, error message, and total count. - success (bool): True if indexing succeeded, False otherwise. - error_message (str): Error message if indexing failed, empty string otherwise. - total (int): The total number of chunks indexed.

Raises:

Type	Description
`ValueError`	If file_id is not provided in kwargs.

`index_chunk(element, **kwargs)`

Index a single chunk.

This method only indexes the chunk. It does NOT update the metadata of neighboring chunks (previous_chunk/next_chunk). The caller is responsible for maintaining chunk relationships by updating adjacent chunks' metadata separately.

Parameters:

Name	Type	Description	Default
`element`	`dict[str, Any]`	The chunk to be indexed.	required
`**kwargs`	`Any`	Additional keyword arguments for customization.	`{}`

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: Response with success status, error message, and chunk_id.

Raises:

Type	Description
`NotImplementedError`	This method is not yet implemented.

`index_file_chunks(elements, file_id, **kwargs)`

Index chunks for a specific file.

This method indexes chunks for a file. The indexer is responsible for deleting any existing chunks for the file_id before indexing the new chunks to ensure consistency.

Parameters:

Name	Type	Description	Default
`elements`	`list[dict[str, Any]]`	The chunks to be indexed.	required
`file_id`	`str`	The ID of the file these chunks belong to.	required
`**kwargs`	`Any`	Additional keyword arguments for customization.	`{}`

Kwargs

vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration. batch_size (int, optional): The number of chunks to process in each batch. Defaults to 100. max_retries (int, optional): The maximum number of retry attempts for failed batches. Defaults to 3.

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: Response with success status, error message, and total count.

`update_chunk(element, **kwargs)`

Update a chunk by chunk ID.

This method updates both the text content and metadata of a chunk. When text content is updated, the chunk should be re-processed through data generators and re-indexed with updated vector embeddings.

Parameters:

Name	Type	Description	Default
`element`	`dict[str, Any]`	The updated chunk data.	required
`**kwargs`	`Any`	Additional keyword arguments for customization.	`{}`

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: Response with success status, error message, and chunk_id.

Raises:

Type	Description
`NotImplementedError`	This method is not yet implemented.

`update_chunk_metadata(chunk_id, file_id, metadata, **kwargs)`

Update metadata for a specific chunk.

This method patches new metadata into the existing chunk metadata. Existing metadata fields will be overwritten, and new fields will be added. System-managed metadata fields (file_id, chunk_id, etc.) should be preserved and not overwritten.

Parameters:

Name	Type	Description	Default
`chunk_id`	`str`	The ID of the chunk to update.	required
`file_id`	`str`	The ID of the file the chunk belongs to.	required
`metadata`	`dict[str, Any]`	The metadata fields to update.	required
`**kwargs`	`Any`	Additional keyword arguments for customization.	`{}`

Returns:

Type	Description
`dict[str, Any]`	dict[str, Any]: Response with success status and error message.

Raises:

Type	Description
`NotImplementedError`	This method is not yet implemented.