Chunk Processor
Modules concerning the chunk processors used in Gen AI applications.
ChunkMetadataMerger(merger_func_map=None, default_merger_func=None, retained_keys=None)
A helper class to merge metadata from multiple chunks.
Attributes:
| Name | Type | Description |
|---|---|---|
merger_func_map |
dict[str, Callable[[list[Any]], Any]]
|
A mapping of metadata keys to merger functions. |
default_merger_func |
Callable[[list[Any]], Any]
|
The default merger function for metadata keys that are not present in the merger_func_map. |
retained_keys |
set[str] | None
|
The keys that should be retained in the merged metadata. If None, all intersection keys are retained. |
Initializes a new instance of the ChunkMetadataMerger class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
merger_func_map
|
dict[str, Callable[[list[Any]], Any]] | None
|
A mapping of metadata keys to merger functions. Defaults to None, in which case a default merger map is used. The default merger map: 1. Picks the first value of the PREV_CHUNK_ID key. 2. Picks the last value of the NEXT_CHUNK_ID key. |
None
|
default_merger_func
|
Callable[[list[Any]], Any] | None
|
The default merger for metadata keys that are not present in the merger_func_map. Defaults to None, in which case a default merger that picks the first value is used. |
None
|
retained_keys
|
set[str] | None
|
The keys that should be retained in the merged metadata. Defaults to None, in which case all intersection keys are retained. |
None
|
merge(metadatas)
Merges metadata from multiple chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadatas
|
list[dict[str, Any]]
|
The metadata to merge. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
dict[str, Any]: The merged metadata. |
DedupeChunkProcessor
Bases: BaseChunkProcessor
A chunk processor that removes duplicate chunks.
The DedupeChunkProcessor class provides functionality for processing a list of chunks by removing duplicates.
The duplicates are determined based on the chunk's ID and content. It ensures that each chunk in the final list
is unique, both in terms of its ID and its content.
Examples:
```python import asyncio from gllm_core.schema import Chunk from gllm_retrieval.chunk_processor import DedupeChunkProcessor
chunks = [ Chunk(id="chunk-1", content="Jakarta, Indonesia", metadata={"source": "source-1"}), Chunk(id="chunk-2", content="Kuala Lumpur, Malaysia", metadata={"source": "source-2"}), Chunk(id="chunk-3", content="Bangkok, Thailand", metadata={"source": "source-3"}), Chunk(id="chunk-1", content="Jakarta, Indonesia", metadata={"source": "source-1"}), Chunk(id="chunk-4", content="Kuala Lumpur, Malaysia", metadata={"source": "source-2"}), ]
processor = DedupeChunkProcessor() result = asyncio.run(processor.process_chunks(chunks)) print(result)
```
Attributes: None
MergingChunkProcessor(prev_chunk_id_metadata=DefaultChunkMetadata.PREV_CHUNK_ID, next_chunk_id_metadata=DefaultChunkMetadata.NEXT_CHUNK_ID, id_merger_func=concatenate('-'), content_merger_func=merge_overlapping_strings('\n'), metadata_merger=None)
Bases: BaseChunkProcessor
A chunk processor that gathers and merges together adjacent chunks.
The MergingChunkProcessor class identifies related chunks based on their adjacent chunk IDs,
merges the related chunks, and outputs a list of combined chunks. When merging chunks and their content,
it handles overlaps and common prefixes to ensure smooth merging.
Examples:
import asyncio
from gllm_core.schema import Chunk
from gllm_retrieval.chunk_processor import MergingChunkProcessor
chunks = [
Chunk(
id="chunk1",
content="Hello World!",
metadata={prev_id_key: "chunk0", next_id_key: "chunk2"},
),
Chunk(
id="chunk3",
content="beautiful today, isn't it?",
metadata={prev_id_key: "chunk2", next_id_key: "chunk4"},
),
Chunk(
id="chunk2",
content="World! It is beautiful",
metadata={prev_id_key: "chunk1", next_id_key: "chunk3"},
),
]
processor = MergingChunkProcessor()
result = asyncio.run(processor.process_chunks(chunks))
print(result)
Attributes:
| Name | Type | Description |
|---|---|---|
prev_chunk_id_metadata |
str
|
The metadata key for the previous chunk ID. |
next_chunk_id_metadata |
str
|
The metadata key for the next chunk ID. |
id_merger_func |
Callable[[list[str]], str]
|
The function used to merge the IDs of merged chunks. The function should receive a list of IDs of the chunks that are being merged and the output will be used as the ID of the merged chunk. |
content_merger_func |
Callable[[list[str]], str]
|
The function used to merge the content of merged chunks. The function should receive a list of contents of the chunks that are being merged and the output will be used as the content of the merged chunk. |
metadata_merger |
ChunkMetadataMerger
|
The metadata merger used to merge the metadata of merged chunks. The merger should receive a list of metadata of the chunks that are being merged and the output will be used as the metadata of the merged chunk. |
Initializes a new instance of the MergingChunkProcessor class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prev_chunk_id_metadata
|
str
|
The metadata key for the previous chunk ID.
Defaults to |
PREV_CHUNK_ID
|
next_chunk_id_metadata
|
str
|
The metadata key for the next chunk ID.
Defaults to |
NEXT_CHUNK_ID
|
id_merger_func
|
Callable[[list[str]], str]
|
The function used to merge the IDs of merged chunks.
The function should receive a list of IDs of the chunks that are being merged and the output will be
used as the ID of the merged chunk. Defaults to |
concatenate('-')
|
content_merger_func
|
Callable[[list[str]], str]
|
The function used to merge the content of
merged chunks. The function should receive a list of contents of the chunks that are being merged
and the output will be used as the content of the merged chunk.
Defaults to |
merge_overlapping_strings('\n')
|
metadata_merger
|
ChunkMetadataMerger | None
|
The metadata merger used to merge the metadata of
merged chunks. The merger should receive a list of metadata of the chunks that are being merged and
the output will be used as the metadata of the merged chunk. Defaults to None, in which case a
default |
None
|
concatenate(delimiter='-')
Creates a function that concatenates a list of values with a delimiter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
delimiter
|
str
|
The delimiter to use when concatenating the values. Defaults to "-". |
'-'
|
Returns:
| Type | Description |
|---|---|
Callable[[list[Any]], str]
|
Callable[[list[Any]], str]: A function that concatenates a list of values with the delimiter. |
merge_overlapping_strings(delimiter='\n', min_overlap=1, max_window=200)
Creates a function that merges a list of strings, handling common prefixes and overlaps.
The created function will: 1. Identify and remove any common prefix shared by the strings. 2. Process each pair of adjacent strings to remove overlapping strings. 3. Join the cleaned strings together, including the common prefix at the beginning.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
delimiter
|
str
|
The delimiter to use when merging the values. Defaults to "\n". |
'\n'
|
min_overlap
|
int
|
Minimum overlap length to consider valid. Defaults to 1. |
1
|
max_window
|
int
|
Maximum window size to search for overlaps. Defaults to 200. |
200
|
Returns:
| Type | Description |
|---|---|
Callable[[list[str]], str]
|
Callable[[list[str]], str]: A function that merges a list of strings, handling common prefixes and overlaps. |
pick_first(values)
Picks the first value from a list of values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
list[Any]
|
The values to pick from. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Any |
Any
|
The first value from the list. |
pick_last(values)
Picks the last value from a list of values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
values
|
list[Any]
|
The values to pick from. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Any |
Any
|
The last value from the list. |