Skip to content

Chunk Processor

Modules concerning the chunk processors used in Gen AI applications.

DedupeChunkProcessor

Bases: BaseChunkProcessor

A chunk processor that removes duplicate chunks.

The DedupeChunkProcessor class provides functionality for processing a list of chunks by removing duplicates. The duplicates are determined based on the chunk's ID and content. It ensures that each chunk in the final list is unique, both in terms of its ID and its content.

process_chunks(chunks) async

Processes a list of chunks by removing duplicate chunks.

This function processes a list of chunks by eliminating duplicates based on the chunk's ID and content hash. It uses SHA-256 hashing to efficiently compare chunk contents, saving memory compared to storing full content. It also adds a metadata to store the metadata of the chunks with duplicate content for the retained chunks, if any.

Parameters:

Name Type Description Default
chunks list[Chunk]

A list of Chunk objects to be processed.

required

Returns:

Type Description
list[Chunk]

list[Chunk]: A list of unique Chunk objects with duplicates removed.

MergingChunkProcessor(prev_chunk_id_metadata=DefaultChunkMetadata.PREV_CHUNK_ID, next_chunk_id_metadata=DefaultChunkMetadata.NEXT_CHUNK_ID, id_merger_func=MergerMethod.concatenate(), content_merger_func=MergerMethod.merge_overlapping_strings(), metadata_merger=None)

Bases: BaseChunkProcessor

A chunk processor that gathers and merges together adjacent chunks.

The MergingChunkProcessor class identifies related chunks based on their adjacent chunk IDs, merges the related chunks, and outputs a list of combined chunks. When merging chunks and their content, it handles overlaps and common prefixes to ensure smooth merging.

Attributes:

Name Type Description
prev_chunk_id_metadata str

The metadata key for the previous chunk ID.

next_chunk_id_metadata str

The metadata key for the next chunk ID.

id_merger_func Callable[[list[str]], str]

The function used to merge the IDs of merged chunks. The function should receive a list of IDs of the chunks that are being merged and the output will be used as the ID of the merged chunk.

content_merger_func Callable[[list[str]], str]

The function used to merge the content of merged chunks. The function should receive a list of contents of the chunks that are being merged and the output will be used as the content of the merged chunk.

metadata_merger ChunkMetadataMerger

The metadata merger used to merge the metadata of merged chunks. The merger should receive a list of metadata of the chunks that are being merged and the output will be used as the metadata of the merged chunk.

Initializes a new instance of the MergingChunkProcessor class.

Parameters:

Name Type Description Default
prev_chunk_id_metadata str

The metadata key for the previous chunk ID. Defaults to DefaultChunkMetadata.PREV_CHUNK_ID.

PREV_CHUNK_ID
next_chunk_id_metadata str

The metadata key for the next chunk ID. Defaults to DefaultChunkMetadata.NEXT_CHUNK_ID.

NEXT_CHUNK_ID
id_merger_func Callable[[list[str]], str]

The function used to merge the IDs of merged chunks. The function should receive a list of IDs of the chunks that are being merged and the output will be used as the ID of the merged chunk. Defaults to MergerMethod.concatenate().

concatenate()
content_merger_func Callable[[list[str]], str]

The function used to merge the content of merged chunks. The function should receive a list of contents of the chunks that are being merged and the output will be used as the content of the merged chunk. Defaults to MergerMethod.merge_overlapping_strings().

merge_overlapping_strings()
metadata_merger ChunkMetadataMerger | None

The metadata merger used to merge the metadata of merged chunks. The merger should receive a list of metadata of the chunks that are being merged and the output will be used as the metadata of the merged chunk. Defaults to None, in which case a default ChunkMetadataMerger() is used.

None

process_chunks(chunks) async

Processes a list of chunks by gathering and merging related chunks.

This method iterates through the provided list of chunks, gathers related chunks, and merges them.

Parameters:

Name Type Description Default
chunks list[Chunk]

The list of chunks to be processed.

required

Returns:

Type Description
list[Chunk]

list[Chunk]: A list of merged chunks, where related chunks are combined into a single chunk.