Data Generator
Document Processing Orchestrator Data Generator Package.
This module provides various data generator implementations for different types of data processing.
BaseDataGenerator
Bases: ABC
Base class for data generator.
generate(elements, **kwargs)
abstractmethod
Generates data for a list of chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements |
Any
|
The elements to be used for generating data / metadata. ideally formatted as List[Dict]. |
required |
**kwargs |
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
Any |
Any
|
The generated data, ideally formatted as List[Dict]. Each dictionary within the list are recommended to follows the structure of model 'Element', to ensure consistency and ease of use across Document Processing Orchestrator. |
ImageCaptionDataGenerator(image_to_caption)
Bases: BaseDataGenerator
Data generator for creating captions from images using BaseImageToCaption.
Initialize the ImageCaptionDataGenerator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_to_caption |
BaseImageToCaption
|
The image to caption converter instance. |
required |
generate(elements, **kwargs)
Generates captions by processing images in the input elements.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements |
list[dict[str, Any]]
|
List of dictionaries containing image data. Each dictionary should have an 'image_source' key with the image location. |
required |
**kwargs |
Any
|
Additional keyword arguments for the image captioning process. |
{}
|
Kwargs
image_format_func (Callable[[str, Element], str], optional): Function to format the caption text.
Defaults to None.
element_processing_limit (int, optional): The maximum number of elements to process at a time.
Defaults to 100.
use_image_text_as_context (bool, optional): Whether to use the image text as context.
If set to False, will use image_description instead. Defaults to False.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of dictionaries containing the processed image data. Each dictionary will contain the original data. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If elements don't contain required image information. |
MultiModelImageCaptionDataGenerator(model_api_keys=None)
Bases: BaseDataGenerator
Multi-model image captioning data generator with lazy initialization.
This class extends BaseDataGenerator to provide a data generator for image captioning that supports multiple models with lazy initialization, to avoid API key validation during pipeline initialization.
Key Features: 1. Supports multiple models in a single instance. 2. Lazy initialization to avoid API key validation during initialization. 3. Dynamic model selection at runtime.
Initialize the MultiModelImageCaptionDataGenerator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_api_keys |
dict[str, str] | None
|
Dictionary mapping model IDs to their API keys. Defaults to None, in which case no API keys passed during LMInvoker initialization. |
None
|
generate(elements, **kwargs)
Generate captions for elements with image structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements |
list[dict[str, Any]]
|
List of dictionaries containing elements to be processed. |
required |
**kwargs |
Any
|
Additional keyword arguments for the image captioning process. |
{}
|
Kwargs
model_id (str, optional): The ID of the model to use for image captioning. Defaults to DEFAULT_MODEL_ID which is using the "google/gemini-2.5-flash". system_prompt (str, optional): The system prompt to use for image captioning. Defaults to DEFAULT_SYSTEM_PROMPT. user_prompt (str, optional): The user prompt to use for image captioning. Defaults to DEFAULT_USER_PROMPT.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of dictionaries containing the processed image data. Each dictionary will contain the original data. |