Chunk Processor
Modules concerning the chunk processors used in Gen AI applications.
DedupeChunkProcessor
Bases: BaseChunkProcessor
A chunk processor that removes duplicate chunks.
The DedupeChunkProcessor
class provides functionality for processing a list of chunks by removing duplicates.
The duplicates are determined based on the chunk's ID and content. It ensures that each chunk in the final list
is unique, both in terms of its ID and its content.
process_chunks(chunks)
async
Processes a list of chunks by removing duplicate chunks.
This function processes a list of chunks by eliminating duplicates based on the chunk's ID and content hash. It uses SHA-256 hashing to efficiently compare chunk contents, saving memory compared to storing full content. It also adds a metadata to store the metadata of the chunks with duplicate content for the retained chunks, if any.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunks |
list[Chunk]
|
A list of Chunk objects to be processed. |
required |
Returns:
Type | Description |
---|---|
list[Chunk]
|
list[Chunk]: A list of unique Chunk objects with duplicates removed. |
MergingChunkProcessor(prev_chunk_id_metadata=DefaultChunkMetadata.PREV_CHUNK_ID, next_chunk_id_metadata=DefaultChunkMetadata.NEXT_CHUNK_ID, id_merger_func=MergerMethod.concatenate(), content_merger_func=MergerMethod.merge_overlapping_strings(), metadata_merger=None)
Bases: BaseChunkProcessor
A chunk processor that gathers and merges together adjacent chunks.
The MergingChunkProcessor
class identifies related chunks based on their adjacent chunk IDs,
merges the related chunks, and outputs a list of combined chunks. When merging chunks and their content,
it handles overlaps and common prefixes to ensure smooth merging.
Attributes:
Name | Type | Description |
---|---|---|
prev_chunk_id_metadata |
str
|
The metadata key for the previous chunk ID. |
next_chunk_id_metadata |
str
|
The metadata key for the next chunk ID. |
id_merger_func |
Callable[[list[str]], str]
|
The function used to merge the IDs of merged chunks. The function should receive a list of IDs of the chunks that are being merged and the output will be used as the ID of the merged chunk. |
content_merger_func |
Callable[[list[str]], str]
|
The function used to merge the content of merged chunks. The function should receive a list of contents of the chunks that are being merged and the output will be used as the content of the merged chunk. |
metadata_merger |
ChunkMetadataMerger
|
The metadata merger used to merge the metadata of merged chunks. The merger should receive a list of metadata of the chunks that are being merged and the output will be used as the metadata of the merged chunk. |
Initializes a new instance of the MergingChunkProcessor class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prev_chunk_id_metadata |
str
|
The metadata key for the previous chunk ID.
Defaults to |
PREV_CHUNK_ID
|
next_chunk_id_metadata |
str
|
The metadata key for the next chunk ID.
Defaults to |
NEXT_CHUNK_ID
|
id_merger_func |
Callable[[list[str]], str]
|
The function used to merge the IDs of merged chunks.
The function should receive a list of IDs of the chunks that are being merged and the output will be
used as the ID of the merged chunk. Defaults to |
concatenate()
|
content_merger_func |
Callable[[list[str]], str]
|
The function used to merge the content of
merged chunks. The function should receive a list of contents of the chunks that are being merged
and the output will be used as the content of the merged chunk.
Defaults to |
merge_overlapping_strings()
|
metadata_merger |
ChunkMetadataMerger | None
|
The metadata merger used to merge the metadata of
merged chunks. The merger should receive a list of metadata of the chunks that are being merged and
the output will be used as the metadata of the merged chunk. Defaults to None, in which case a
default |
None
|
process_chunks(chunks)
async
Processes a list of chunks by gathering and merging related chunks.
This method iterates through the provided list of chunks, gathers related chunks, and merges them.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunks |
list[Chunk]
|
The list of chunks to be processed. |
required |
Returns:
Type | Description |
---|---|
list[Chunk]
|
list[Chunk]: A list of merged chunks, where related chunks are combined into a single chunk. |