Indexer
Document Processing Orchestrator Indexer Package.
Modules:
| Name | Description |
|---|---|
BaseIndexer |
Abstract base class for indexing document. |
BaseIndexer
Bases: ABC
An abstract base class for document indexers.
This class defines the structure for managing document chunks in a database. Subclasses are expected to implement methods to handle file-level and chunk-level indexing operations such as creating, reading, updating, and deleting chunks.
Methods:
| Name | Description |
|---|---|
index_file_chunks |
Abstract method to index chunks for a specific file. |
get_file_chunks |
Abstract method to get chunks for a specific file. |
delete_file_chunks |
Abstract method to delete all chunks for a specific file. |
index_chunks |
Abstract method to index multiple chunks. |
index_chunk |
Abstract method to index a single chunk. |
get_chunk |
Abstract method to get a single chunk. |
update_chunk |
Abstract method to update a chunk. |
update_chunk_metadata |
Abstract method to update metadata for a specific chunk. |
delete_chunk |
Abstract method to delete a single chunk. |
delete_chunk(chunk_id, file_id, **kwargs)
abstractmethod
Delete a single chunk by chunk ID and file ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to delete. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the deletion process. Should include: 1. success (bool): True if deletion succeeded, False otherwise. 2. error_message (str): Error message if deletion failed, empty string otherwise. |
delete_file_chunks(file_id, **kwargs)
abstractmethod
Delete all chunks for a specific file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_id
|
str
|
The ID of the file whose chunks should be deleted. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the deletion process. Should include: 1. success (bool): True if deletion succeeded, False otherwise. 2. error_message (str): Error message if deletion failed, empty string otherwise. |
get_chunk(chunk_id, file_id, **kwargs)
abstractmethod
Get a single chunk by chunk ID and file ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to retrieve. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any] | None
|
dict[str, Any] | None: The chunk data following the Element structure with 'text' and 'metadata' keys, or None if the chunk is not found. |
get_file_chunks(file_id, page=0, size=20, **kwargs)
abstractmethod
Get chunks for a specific file with pagination support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_id
|
str
|
The ID of the file to get chunks from. |
required |
page
|
int
|
The page number (0-indexed). Defaults to 0. |
0
|
size
|
int
|
The number of chunks per page. Defaults to 20. |
20
|
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: Response containing: 1. chunks (list[dict[str, Any]]): List of chunks (elements) with text, structure, and metadata. 2. pagination (dict[str, Any]): Pagination metadata with: - page (int): Current page number. - size (int): Number of items per page. - total_chunks (int): Total number of chunks for the file. - total_pages (int): Total number of pages. - has_next (bool): Whether there is a next page. - has_previous (bool): Whether there is a previous page. |
Note
Chunks should be sorted by their metadata.order field (position within the file).
index_chunk(element, **kwargs)
abstractmethod
Index a single chunk.
Note: This method only indexes the chunk. It does NOT update the metadata of neighboring chunks (previous_chunk/next_chunk). The caller is responsible for maintaining chunk relationships by updating adjacent chunks' metadata separately.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element
|
dict[str, Any]
|
The chunk to be indexed. Should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. chunk_id (str): The ID of the indexed chunk. |
index_chunks(elements, **kwargs)
abstractmethod
Index multiple chunks.
This method enables indexing multiple chunks in a single operation without requiring file replacement semantics (i.e., it inserts or overwrites the provided chunks directly without first deleting existing chunks). The chunks provided can belong to multiple different files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
The chunks to be indexed. Each dict should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. total (int): The total number of chunks indexed. |
index_file_chunks(elements, file_id, **kwargs)
abstractmethod
Index chunks for a specific file.
This method indexes chunks for a file. The indexer is responsible for deleting any existing chunks for the file_id before indexing the new chunks to ensure consistency. This ensures that the file's chunks are completely replaced with the new set of chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
The chunks to be indexed. Each dict should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'. |
required |
file_id
|
str
|
The ID of the file these chunks belong to. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the indexing process. Should include: 1. success (bool): True if indexing succeeded, False otherwise. 2. error_message (str): Error message if indexing failed, empty string otherwise. 3. total (int): The total number of chunks indexed. |
update_chunk(element, **kwargs)
abstractmethod
Update a chunk by chunk ID.
This method updates both the text content and metadata of a chunk.
Fails on chunk not found.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element
|
dict[str, Any]
|
The updated chunk data. Should follow the Element structure with 'text' and 'metadata' keys. Metadata must include 'file_id' and 'chunk_id'. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the update process. Should include: 1. success (bool): True if update succeeded, False otherwise. 2. error_message (str): Error message if update failed, empty string otherwise. 3. chunk_id (str): The ID of the updated chunk. |
update_chunk_metadata(chunk_id, file_id, metadata, **kwargs)
abstractmethod
Update metadata for a specific chunk.
This method patches new metadata into the existing chunk metadata. Existing metadata fields will be overwritten, and new fields will be added.
Fails on chunk not found.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_id
|
str
|
The ID of the chunk to update. |
required |
file_id
|
str
|
The ID of the file the chunk belongs to. |
required |
metadata
|
dict[str, Any]
|
The metadata fields to update. Only the provided fields will be updated; other existing metadata will remain unchanged. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The response from the update process. Should include: 1. success (bool): True if update succeeded, False otherwise. 2. error_message (str): Error message if update failed, empty string otherwise. |