Skip to content

Vector

Vector indexer module.

VectorDBIndexer(data_store_map=None, em_invoker_map=None, cache_size=DEFAULT_CACHE_SIZE, retryable_exceptions=None)

Bases: BaseIndexer

Index elements into a vector datastore capability.

Initialize the indexer with mappings for vector DB capabilities and embeddings.

Parameters:

Name Type Description Default
data_store_map dict[str, Type[BaseDataStore]] | None

Mapping of db_engine strings to BaseDataStore classes. If not provided, uses DEFAULT_DATA_STORE_MAP which includes "chroma", "elasticsearch", and "opensearch". Defaults to None.

None
em_invoker_map dict[str, Type[BaseEMInvoker]] | None

Mapping of provider strings to embedding classes (BaseEMInvoker subclasses). If not provided, uses DEFAULT_EM_INVOKER_MAP which includes "azure-openai", "bedrock", "google", "openai", and "voyage". Defaults to None.

None
cache_size int

Maximum number of vector capability instances to cache using LRU policy. Defaults to DEFAULT_CACHE_SIZE (128).

DEFAULT_CACHE_SIZE
retryable_exceptions tuple[type[Exception], ...] | None

Tuple of exception types to retry on during batch processing. Defaults to DEFAULT_RETRYABLE_EXCEPTIONS.

None

delete(**kwargs)

Delete documents from the vector capability based on the file ID.

This method validates that file_id is present in kwargs and delegates to delete_file_chunks.

Kwargs

file_id (str): The ID of the file(s) to be deleted. vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status and error message. - success (bool): True if deletion succeeded, False otherwise. - error_message (str): Error message if deletion failed, empty string otherwise.

Raises:

Type Description
ValueError

If file_id is not provided in kwargs.

delete_chunk(chunk_id, file_id, **kwargs)

Delete a single chunk by chunk ID and file ID.

Parameters:

Name Type Description Default
chunk_id str

The ID of the chunk to delete.

required
file_id str

The ID of the file the chunk belongs to.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status and error message.

Raises:

Type Description
NotImplementedError

This method is not yet implemented.

delete_file_chunks(file_id, **kwargs)

Delete all chunks for a specific file.

  • No index: treated as success (nothing to delete).
  • Index exists, no matching chunks: success.
  • Index exists, matching chunks: success if delete succeeds, otherwise failed.

Parameters:

Name Type Description Default
file_id str

The ID of the file whose chunks should be deleted.

required
**kwargs Any

Additional keyword arguments for customization.

{}
Kwargs

vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status and error message.

get_chunk(chunk_id, file_id, **kwargs)

Get a single chunk by chunk ID and file ID.

Parameters:

Name Type Description Default
chunk_id str

The ID of the chunk to retrieve.

required
file_id str

The ID of the file the chunk belongs to.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any] | None

dict[str, Any] | None: The chunk data, or None if not found.

Raises:

Type Description
NotImplementedError

This method is not yet implemented.

get_file_chunks(file_id, page=0, size=20, **kwargs)

Get chunks for a specific file with pagination support.

Parameters:

Name Type Description Default
file_id str

The ID of the file to get chunks from.

required
page int

The page number (0-indexed). Defaults to 0.

0
size int

The number of chunks per page. Defaults to 20.

20
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with chunks list, total count, and pagination info.

Raises:

Type Description
NotImplementedError

This method is not yet implemented.

index(elements, **kwargs)

Index elements into the configured vector capability.

This method validates that file_id is present in kwargs and delegates to index_file_chunks.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

Parsed elements containing text and metadata.

required
**kwargs Any

Additional keyword arguments for customization.

{}
Kwargs

file_id (str): The ID of the file these chunks belong to. vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration. batch_size (int, optional): The number of chunks to process in each batch. Defaults to 100. max_retries (int, optional): The maximum number of retry attempts for failed batches. Defaults to 3.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status, error message, and total count. - success (bool): True if indexing succeeded, False otherwise. - error_message (str): Error message if indexing failed, empty string otherwise. - total (int): The total number of chunks indexed.

Raises:

Type Description
ValueError

If file_id is not provided in kwargs.

index_chunk(element, **kwargs)

Index a single chunk.

This method only indexes the chunk. It does NOT update the metadata of neighboring chunks (previous_chunk/next_chunk). The caller is responsible for maintaining chunk relationships by updating adjacent chunks' metadata separately.

Parameters:

Name Type Description Default
element dict[str, Any]

The chunk to be indexed.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status, error message, and chunk_id.

Raises:

Type Description
NotImplementedError

This method is not yet implemented.

index_file_chunks(elements, file_id, **kwargs)

Index chunks for a specific file.

This method indexes chunks for a file. The indexer is responsible for deleting any existing chunks for the file_id before indexing the new chunks to ensure consistency.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

The chunks to be indexed.

required
file_id str

The ID of the file these chunks belong to.

required
**kwargs Any

Additional keyword arguments for customization.

{}
Kwargs

vectorizer_kwargs (dict[str, Any]): Vectorizer configuration containing "model" key in format "provider/model_name" (e.g., "openai/text-embedding-ada-002"). db_engine (str): The database engine to use (e.g., "chroma", "elasticsearch", "opensearch"). db_config (dict[str, Any]): Vector DB configuration. batch_size (int, optional): The number of chunks to process in each batch. Defaults to 100. max_retries (int, optional): The maximum number of retry attempts for failed batches. Defaults to 3.

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status, error message, and total count.

update_chunk(element, **kwargs)

Update a chunk by chunk ID.

This method updates both the text content and metadata of a chunk. When text content is updated, the chunk should be re-processed through data generators and re-indexed with updated vector embeddings.

Parameters:

Name Type Description Default
element dict[str, Any]

The updated chunk data.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status, error message, and chunk_id.

Raises:

Type Description
NotImplementedError

This method is not yet implemented.

update_chunk_metadata(chunk_id, file_id, metadata, **kwargs)

Update metadata for a specific chunk.

This method patches new metadata into the existing chunk metadata. Existing metadata fields will be overwritten, and new fields will be added. System-managed metadata fields (file_id, chunk_id, etc.) should be preserved and not overwritten.

Parameters:

Name Type Description Default
chunk_id str

The ID of the chunk to update.

required
file_id str

The ID of the file the chunk belongs to.

required
metadata dict[str, Any]

The metadata fields to update.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response with success status and error message.

Raises:

Type Description
NotImplementedError

This method is not yet implemented.