Skip to content

Chunk Processor

Modules concerning the chunk processors used in Gen AI applications.

ChunkMetadataMerger(merger_func_map=None, default_merger_func=None, retained_keys=None)

A helper class to merge metadata from multiple chunks.

Attributes:

Name Type Description
merger_func_map dict[str, Callable[[list[Any]], Any]]

A mapping of metadata keys to merger functions.

default_merger_func Callable[[list[Any]], Any]

The default merger function for metadata keys that are not present in the merger_func_map.

retained_keys set[str] | None

The keys that should be retained in the merged metadata. If None, all intersection keys are retained.

Initializes a new instance of the ChunkMetadataMerger class.

Parameters:

Name Type Description Default
merger_func_map dict[str, Callable[[list[Any]], Any]] | None

A mapping of metadata keys to merger functions. Defaults to None, in which case a default merger map is used. The default merger map: 1. Picks the first value of the PREV_CHUNK_ID key. 2. Picks the last value of the NEXT_CHUNK_ID key.

None
default_merger_func Callable[[list[Any]], Any] | None

The default merger for metadata keys that are not present in the merger_func_map. Defaults to None, in which case a default merger that picks the first value is used.

None
retained_keys set[str] | None

The keys that should be retained in the merged metadata. Defaults to None, in which case all intersection keys are retained.

None

merge(metadatas)

Merges metadata from multiple chunks.

Parameters:

Name Type Description Default
metadatas list[dict[str, Any]]

The metadata to merge.

required

Returns:

Type Description
dict[str, Any]

dict[str, Any]: The merged metadata.

DedupeChunkProcessor

Bases: BaseChunkProcessor

A chunk processor that removes duplicate chunks.

The DedupeChunkProcessor class provides functionality for processing a list of chunks by removing duplicates. The duplicates are determined based on the chunk's ID and content. It ensures that each chunk in the final list is unique, both in terms of its ID and its content.

Examples:

```python import asyncio from gllm_core.schema import Chunk from gllm_retrieval.chunk_processor import DedupeChunkProcessor

chunks = [ Chunk(id="chunk-1", content="Jakarta, Indonesia", metadata={"source": "source-1"}), Chunk(id="chunk-2", content="Kuala Lumpur, Malaysia", metadata={"source": "source-2"}), Chunk(id="chunk-3", content="Bangkok, Thailand", metadata={"source": "source-3"}), Chunk(id="chunk-1", content="Jakarta, Indonesia", metadata={"source": "source-1"}), Chunk(id="chunk-4", content="Kuala Lumpur, Malaysia", metadata={"source": "source-2"}), ]

processor = DedupeChunkProcessor() result = asyncio.run(processor.process_chunks(chunks)) print(result)

```

Attributes: None

MergingChunkProcessor(prev_chunk_id_metadata=DefaultChunkMetadata.PREV_CHUNK_ID, next_chunk_id_metadata=DefaultChunkMetadata.NEXT_CHUNK_ID, id_merger_func=concatenate('-'), content_merger_func=merge_overlapping_strings('\n'), metadata_merger=None)

Bases: BaseChunkProcessor

A chunk processor that gathers and merges together adjacent chunks.

The MergingChunkProcessor class identifies related chunks based on their adjacent chunk IDs, merges the related chunks, and outputs a list of combined chunks. When merging chunks and their content, it handles overlaps and common prefixes to ensure smooth merging.

Examples:

import asyncio
from gllm_core.schema import Chunk
from gllm_retrieval.chunk_processor import MergingChunkProcessor

chunks = [
    Chunk(
        id="chunk1",
        content="Hello World!",
        metadata={prev_id_key: "chunk0", next_id_key: "chunk2"},
    ),
    Chunk(
        id="chunk3",
        content="beautiful today, isn't it?",
        metadata={prev_id_key: "chunk2", next_id_key: "chunk4"},
    ),
    Chunk(
        id="chunk2",
        content="World! It is beautiful",
        metadata={prev_id_key: "chunk1", next_id_key: "chunk3"},
    ),
]

processor = MergingChunkProcessor()
result = asyncio.run(processor.process_chunks(chunks))
print(result)

Attributes:

Name Type Description
prev_chunk_id_metadata str

The metadata key for the previous chunk ID.

next_chunk_id_metadata str

The metadata key for the next chunk ID.

id_merger_func Callable[[list[str]], str]

The function used to merge the IDs of merged chunks. The function should receive a list of IDs of the chunks that are being merged and the output will be used as the ID of the merged chunk.

content_merger_func Callable[[list[str]], str]

The function used to merge the content of merged chunks. The function should receive a list of contents of the chunks that are being merged and the output will be used as the content of the merged chunk.

metadata_merger ChunkMetadataMerger

The metadata merger used to merge the metadata of merged chunks. The merger should receive a list of metadata of the chunks that are being merged and the output will be used as the metadata of the merged chunk.

Initializes a new instance of the MergingChunkProcessor class.

Parameters:

Name Type Description Default
prev_chunk_id_metadata str

The metadata key for the previous chunk ID. Defaults to DefaultChunkMetadata.PREV_CHUNK_ID.

PREV_CHUNK_ID
next_chunk_id_metadata str

The metadata key for the next chunk ID. Defaults to DefaultChunkMetadata.NEXT_CHUNK_ID.

NEXT_CHUNK_ID
id_merger_func Callable[[list[str]], str]

The function used to merge the IDs of merged chunks. The function should receive a list of IDs of the chunks that are being merged and the output will be used as the ID of the merged chunk. Defaults to concatenate("-").

concatenate('-')
content_merger_func Callable[[list[str]], str]

The function used to merge the content of merged chunks. The function should receive a list of contents of the chunks that are being merged and the output will be used as the content of the merged chunk. Defaults to merge_overlapping_strings("\n").

merge_overlapping_strings('\n')
metadata_merger ChunkMetadataMerger | None

The metadata merger used to merge the metadata of merged chunks. The merger should receive a list of metadata of the chunks that are being merged and the output will be used as the metadata of the merged chunk. Defaults to None, in which case a default ChunkMetadataMerger() is used.

None

concatenate(delimiter='-')

Creates a function that concatenates a list of values with a delimiter.

Parameters:

Name Type Description Default
delimiter str

The delimiter to use when concatenating the values. Defaults to "-".

'-'

Returns:

Type Description
Callable[[list[Any]], str]

Callable[[list[Any]], str]: A function that concatenates a list of values with the delimiter.

merge_overlapping_strings(delimiter='\n', min_overlap=1, max_window=200)

Creates a function that merges a list of strings, handling common prefixes and overlaps.

The created function will: 1. Identify and remove any common prefix shared by the strings. 2. Process each pair of adjacent strings to remove overlapping strings. 3. Join the cleaned strings together, including the common prefix at the beginning.

Parameters:

Name Type Description Default
delimiter str

The delimiter to use when merging the values. Defaults to "\n".

'\n'
min_overlap int

Minimum overlap length to consider valid. Defaults to 1.

1
max_window int

Maximum window size to search for overlaps. Defaults to 200.

200

Returns:

Type Description
Callable[[list[str]], str]

Callable[[list[str]], str]: A function that merges a list of strings, handling common prefixes and overlaps.

pick_first(values)

Picks the first value from a list of values.

Parameters:

Name Type Description Default
values list[Any]

The values to pick from.

required

Returns:

Name Type Description
Any Any

The first value from the list.

pick_last(values)

Picks the last value from a list of values.

Parameters:

Name Type Description Default
values list[Any]

The values to pick from.

required

Returns:

Name Type Description
Any Any

The last value from the list.