Chunk Processor

Modules concerning the chunk processors used in Gen AI applications.

`DedupeChunkProcessor`

Bases: BaseChunkProcessor

A chunk processor that removes duplicate chunks.

The DedupeChunkProcessor class provides functionality for processing a list of chunks by removing duplicates. The duplicates are determined based on the chunk's ID and content. It ensures that each chunk in the final list is unique, both in terms of its ID and its content.

Examples:

```python import asyncio from gllm_core.schema import Chunk from gllm_retrieval.chunk_processor import DedupeChunkProcessor

chunks = [ Chunk(id="chunk-1", content="Jakarta, Indonesia", metadata={"source": "source-1"}), Chunk(id="chunk-2", content="Kuala Lumpur, Malaysia", metadata={"source": "source-2"}), Chunk(id="chunk-3", content="Bangkok, Thailand", metadata={"source": "source-3"}), Chunk(id="chunk-1", content="Jakarta, Indonesia", metadata={"source": "source-1"}), Chunk(id="chunk-4", content="Kuala Lumpur, Malaysia", metadata={"source": "source-2"}), ]

processor = DedupeChunkProcessor() result = asyncio.run(processor.process_chunks(chunks)) print(result)

```

Attributes: None

`process_chunks(chunks)` `async`

Processes a list of chunks by removing duplicate chunks.

This function processes a list of chunks by eliminating duplicates based on the chunk's ID and content hash. It uses SHA-256 hashing to efficiently compare chunk contents, saving memory compared to storing full content. It also adds a metadata to store the metadata of the chunks with duplicate content for the retained chunks, if any.

Parameters:

Name	Type	Description	Default
`chunks`	`list[Chunk]`	A list of Chunk objects to be processed.	required

Returns:

Type	Description
`list[Chunk]`	list[Chunk]: A list of unique Chunk objects with duplicates removed.

`MergingChunkProcessor(prev_chunk_id_metadata=DefaultChunkMetadata.PREV_CHUNK_ID, next_chunk_id_metadata=DefaultChunkMetadata.NEXT_CHUNK_ID, id_merger_func=MergerMethod.concatenate(), content_merger_func=MergerMethod.merge_overlapping_strings(), metadata_merger=None)`

Bases: BaseChunkProcessor

A chunk processor that gathers and merges together adjacent chunks.

The MergingChunkProcessor class identifies related chunks based on their adjacent chunk IDs, merges the related chunks, and outputs a list of combined chunks. When merging chunks and their content, it handles overlaps and common prefixes to ensure smooth merging.

Examples:

import asyncio
from gllm_core.schema import Chunk
from gllm_retrieval.chunk_processor import MergingChunkProcessor

chunks = [
    Chunk(
        id="chunk1",
        content="Hello World!",
        metadata={prev_id_key: "chunk0", next_id_key: "chunk2"},
    ),
    Chunk(
        id="chunk3",
        content="beautiful today, isn't it?",
        metadata={prev_id_key: "chunk2", next_id_key: "chunk4"},
    ),
    Chunk(
        id="chunk2",
        content="World! It is beautiful",
        metadata={prev_id_key: "chunk1", next_id_key: "chunk3"},
    ),
]

processor = MergingChunkProcessor()
result = asyncio.run(processor.process_chunks(chunks))
print(result)

Attributes:

Name	Type	Description
`prev_chunk_id_metadata`	`str`	The metadata key for the previous chunk ID.
`next_chunk_id_metadata`	`str`	The metadata key for the next chunk ID.
`id_merger_func`	`Callable[[list[str]], str]`	The function used to merge the IDs of merged chunks. The function should receive a list of IDs of the chunks that are being merged and the output will be used as the ID of the merged chunk.
`content_merger_func`	`Callable[[list[str]], str]`	The function used to merge the content of merged chunks. The function should receive a list of contents of the chunks that are being merged and the output will be used as the content of the merged chunk.
`metadata_merger`	`ChunkMetadataMerger`	The metadata merger used to merge the metadata of merged chunks. The merger should receive a list of metadata of the chunks that are being merged and the output will be used as the metadata of the merged chunk.

Initializes a new instance of the MergingChunkProcessor class.

Parameters:

Name	Type	Description	Default
`prev_chunk_id_metadata`	`str`	The metadata key for the previous chunk ID. Defaults to `DefaultChunkMetadata.PREV_CHUNK_ID`.	`PREV_CHUNK_ID`
`next_chunk_id_metadata`	`str`	The metadata key for the next chunk ID. Defaults to `DefaultChunkMetadata.NEXT_CHUNK_ID`.	`NEXT_CHUNK_ID`
`id_merger_func`	`Callable[[list[str]], str]`	The function used to merge the IDs of merged chunks. The function should receive a list of IDs of the chunks that are being merged and the output will be used as the ID of the merged chunk. Defaults to `MergerMethod.concatenate()`.	`concatenate()`
`content_merger_func`	`Callable[[list[str]], str]`	The function used to merge the content of merged chunks. The function should receive a list of contents of the chunks that are being merged and the output will be used as the content of the merged chunk. Defaults to `MergerMethod.merge_overlapping_strings()`.	`merge_overlapping_strings()`
`metadata_merger`	`ChunkMetadataMerger \| None`	The metadata merger used to merge the metadata of merged chunks. The merger should receive a list of metadata of the chunks that are being merged and the output will be used as the metadata of the merged chunk. Defaults to None, in which case a default `ChunkMetadataMerger()` is used.	`None`

`process_chunks(chunks)` `async`

Processes a list of chunks by gathering and merging related chunks.

This method iterates through the provided list of chunks, gathers related chunks, and merges them.

Parameters:

Name	Type	Description	Default
`chunks`	`list[Chunk]`	The list of chunks to be processed.	required

Returns:

Type	Description
`list[Chunk]`	list[Chunk]: A list of merged chunks, where related chunks are combined into a single chunk.

Chunk Processor

DedupeChunkProcessor

process_chunks(chunks) async

MergingChunkProcessor(prev_chunk_id_metadata=DefaultChunkMetadata.PREV_CHUNK_ID, next_chunk_id_metadata=DefaultChunkMetadata.NEXT_CHUNK_ID, id_merger_func=MergerMethod.concatenate(), content_merger_func=MergerMethod.merge_overlapping_strings(), metadata_merger=None)

process_chunks(chunks) async

`DedupeChunkProcessor`

`process_chunks(chunks)` `async`

`MergingChunkProcessor(prev_chunk_id_metadata=DefaultChunkMetadata.PREV_CHUNK_ID, next_chunk_id_metadata=DefaultChunkMetadata.NEXT_CHUNK_ID, id_merger_func=MergerMethod.concatenate(), content_merger_func=MergerMethod.merge_overlapping_strings(), metadata_merger=None)`

`process_chunks(chunks)` `async`