Skip to content

Indexer

Document Processing Orchestrator Indexer Package.

Modules:

Name Description
BaseIndexer

Abstract base class for indexing document.

BaseIndexer

Bases: ABC

An abstract base class for document indexers.

This class defines the structure for managing document chunks in a database. Subclasses are expected to implement methods to handle file-level and chunk-level indexing operations such as creating, reading, updating, and deleting chunks.

Methods:

Name Description
index_file_chunks

Abstract method to index chunks for a specific file.

get_file_chunks

Abstract method to get chunks for a specific file.

delete_file_chunks

Abstract method to delete all chunks for a specific file.

index_chunks

Abstract method to index multiple chunks.

index_chunk

Abstract method to index a single chunk.

get_chunk

Abstract method to get a single chunk.

update_chunk

Abstract method to update a chunk.

update_chunk_metadata

Abstract method to update metadata for a specific chunk.

delete_chunk

Abstract method to delete a single chunk.

delete_chunk(chunk_id, file_id, **kwargs) abstractmethod

Delete a single chunk by chunk ID and file ID.

Parameters:

Name Type Description Default
chunk_id str

The ID of the chunk to delete.

required
file_id str

The ID of the file the chunk belongs to.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the deletion process. Should include: 1. success (bool): True if deletion succeeded, False otherwise. 2. error_message (str): Error message if deletion failed, empty string otherwise.

delete_file_chunks(file_id, **kwargs) abstractmethod

Delete all chunks for a specific file.

Parameters:

Name Type Description Default
file_id str

The ID of the file whose chunks should be deleted.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the deletion process. Should include: 1. success (bool): True if deletion succeeded, False otherwise. 2. error_message (str): Error message if deletion failed, empty string otherwise.

get_chunk(chunk_id, file_id, **kwargs) abstractmethod

Get a single chunk by chunk ID and file ID.

Parameters:

Name Type Description Default
chunk_id str

The ID of the chunk to retrieve.

required
file_id str

The ID of the file the chunk belongs to.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any] | None

dict[str, Any] | None: The chunk data following the Element structure with 'text' and 'metadata' keys, or None if the chunk is not found.

get_file_chunks(file_id, page=0, size=20, **kwargs) abstractmethod

Get chunks for a specific file with pagination support.

Parameters:

Name Type Description Default
file_id str

The ID of the file to get chunks from.

required
page int

The page number (0-indexed). Defaults to 0.

0
size int

The number of chunks per page. Defaults to 20.

20
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: Response containing: 1. chunks (list[dict[str, Any]]): List of chunks (elements) with text, structure, and metadata. 2. pagination (dict[str, Any]): Pagination metadata with: - page (int): Current page number. - size (int): Number of items per page. - total_chunks (int): Total number of chunks for the file. - total_pages (int): Total number of pages. - has_next (bool): Whether there is a next page. - has_previous (bool): Whether there is a previous page.

Note

Chunks should be sorted by their metadata.order field (position within the file).

index_chunk(element, **kwargs) abstractmethod

Index a single chunk.

Note: This method only indexes the chunk. It does NOT update the metadata of neighboring chunks (previous_chunk/next_chunk). The caller is responsible for maintaining chunk relationships by updating adjacent chunks' metadata separately.

Parameters:

Name Type Description Default
element dict[str, Any]

The chunk to be indexed. Should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. chunk_id (str): The ID of the indexed chunk.

index_chunks(elements, **kwargs) abstractmethod

Index multiple chunks.

This method enables indexing multiple chunks in a single operation without requiring file replacement semantics (i.e., it inserts or overwrites the provided chunks directly without first deleting existing chunks). The chunks provided can belong to multiple different files.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

The chunks to be indexed. Each dict should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. total (int): The total number of chunks indexed.

index_file_chunks(elements, file_id, **kwargs) abstractmethod

Index chunks for a specific file.

This method indexes chunks for a file. The indexer is responsible for deleting any existing chunks for the file_id before indexing the new chunks to ensure consistency. This ensures that the file's chunks are completely replaced with the new set of chunks.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

The chunks to be indexed. Each dict should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'.

required
file_id str

The ID of the file these chunks belong to.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. total (int): The total number of chunks indexed.

update_chunk(element, **kwargs) abstractmethod

Update a chunk by chunk ID.

This method updates both the text content and metadata of a chunk.

Fails on chunk not found.

Parameters:

Name Type Description Default
element dict[str, Any]

The updated chunk data. Should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the update process. Should include: 1. success (bool): True if update succeeded, False otherwise. 2. error_message (str): Error message if update failed, empty string otherwise. 3. chunk_id (str): The ID of the updated chunk.

update_chunk_metadata(chunk_id, file_id, metadata, **kwargs) abstractmethod

Update metadata for a specific chunk.

This method patches new metadata into the existing chunk metadata. Existing metadata fields will be overwritten, and new fields will be added.

Fails on chunk not found.

Parameters:

Name Type Description Default
chunk_id str

The ID of the chunk to update.

required
file_id str

The ID of the file the chunk belongs to.

required
metadata dict[str, Any]

The metadata fields to update. Only the provided fields will be updated; other existing metadata will remain unchanged.

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The response from the update process. Should include: 1. success (bool): True if update succeeded, False otherwise. 2. error_message (str): Error message if update failed, empty string otherwise.