Data Generator
Document Processing Orchestrator Data Generator Package.
This module provides various data generator implementations for different types of data processing.
BaseDataGenerator
Bases: ABC
Base class for data generator.
generate(elements, **kwargs)
abstractmethod
Generates data for a list of chunks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
Any
|
The elements to be used for generating data / metadata. ideally formatted as List[Dict]. |
required |
**kwargs
|
Any
|
Additional keyword arguments for customization. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
Any |
Any
|
The generated data, ideally formatted as List[Dict]. Each dictionary within the list are recommended to follows the structure of model 'Element', to ensure consistency and ease of use across Document Processing Orchestrator. |
ImageCaptionDataGenerator(image_to_caption)
Bases: BaseDataGenerator
Data generator for creating captions from images using BaseImageToCaption.
Initialize the ImageCaptionDataGenerator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_to_caption
|
BaseImageToCaption
|
The image to caption converter instance. |
required |
generate(elements, **kwargs)
Generates captions by processing images in the input elements.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
List of dictionaries containing image data. Each dictionary should have an 'image_source' key with the image location. |
required |
**kwargs
|
Any
|
Additional keyword arguments for the image captioning process. |
{}
|
Kwargs
image_format_func (Callable[[str, Element], str], optional): Function to format the caption text.
Defaults to None.
element_processing_limit (int, optional): The maximum number of elements to process at a time.
Defaults to 100.
use_image_text_as_context (bool, optional): Whether to use the image text as context.
If set to False, will use image_description instead. Defaults to False.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of dictionaries containing the processed image data. Each dictionary will contain the original data. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If elements don't contain required image information. |
LLMTextRewriteDataGenerator(model_api_keys=None, default_model_id=DEFAULT_MODEL_ID, default_system_prompt=DEFAULT_SYSTEM_PROMPT, default_user_prompt=DEFAULT_USER_PROMPT, default_text_rewrite_enabled=True, default_structures_to_rewrite=None, default_hyperparameters=None, default_retry_config=None, default_include_media_images_as_context=True, max_concurrent_requests=DEFAULT_MAX_CONCURRENT_REQUESTS)
Bases: BaseDataGenerator
LLM-powered element text rewrite data generator with lazy initialization and batching.
This generator rewrites the text field of document elements using an LLM while preserving element
structure and metadata. It supports dynamic model selection, configuration-based processor caching,
concurrent batch processing, and optional image context for multimodal models.
Designed primarily for document normalization tasks such as OCR cleanup, formatting correction, and layout refinement.
Initialize the LLMTextRewriteDataGenerator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_api_keys
|
dict[str, str] | None
|
Dictionary mapping model IDs to their API keys. Defaults to None, in which case no API keys are passed during LMInvoker initialization. |
None
|
default_model_id
|
str
|
Default model ID used when |
DEFAULT_MODEL_ID
|
default_system_prompt
|
str
|
Default system prompt used when |
DEFAULT_SYSTEM_PROMPT
|
default_user_prompt
|
str
|
Default user prompt used when |
DEFAULT_USER_PROMPT
|
default_text_rewrite_enabled
|
bool
|
Default value for |
True
|
default_structures_to_rewrite
|
list[str] | None
|
Default value for |
None
|
default_hyperparameters
|
dict[str, Any] | None
|
Default hyperparameters passed to the LMInvoker
when |
None
|
default_retry_config
|
dict[str, Any] | None
|
Default retry configuration passed to the
LMInvoker when |
None
|
default_include_media_images_as_context
|
bool
|
Default value for
|
True
|
max_concurrent_requests
|
int
|
Default maximum number of concurrent LLM requests per
generate() call when |
DEFAULT_MAX_CONCURRENT_REQUESTS
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If max_concurrent_requests is less than 1. |
generate(elements, **kwargs)
Rewrite element.text for elements matching structures_to_rewrite using an LLM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
List of dictionaries containing elements to be processed. |
required |
**kwargs
|
Any
|
Additional keyword arguments for the LLM text rewrite process. |
{}
|
Kwargs
text_rewrite_enabled (bool, optional): Whether to enable LLM text rewriting. Defaults to True
or the value configured in init.
structures_to_rewrite (list[str], optional): List of element structure types whose text
field should be rewritten. Elements with structures not in this list are passed through unchanged.
Defaults to the value configured in init (applies rewriting to all structures if None).
model_id (str, optional): The ID of the model to use for text rewriting. Defaults to DEFAULT_MODEL_ID
or the value configured in init.
system_prompt (str, optional): The system prompt to use for text rewriting.
Defaults to DEFAULT_SYSTEM_PROMPT or the value configured in init.
user_prompt (str, optional): The user prompt template for text rewriting. Must contain a {text} placeholder.
Defaults to DEFAULT_USER_PROMPT or the value configured in init.
default_hyperparameters (dict[str, Any], optional): Additional hyperparameters passed to the LMInvoker
configuration. Defaults to {} or the value configured in init.
retry_config (dict[str, Any], optional): Retry configuration passed to the LMInvoker. Defaults to {}
or the value configured in init.
include_media_images_as_context (bool, optional): Whether to attach associated media images
from element.metadata.media as visual context when invoking the LLM. Defaults to True or the
value configured in init.
max_concurrent_requests (int, optional): Maximum number of LLM requests in flight at once for this
generate() call. Defaults to the value configured in init.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of dictionaries with text rewritten for matching elements. Non-matching elements are returned unchanged. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If max_concurrent_requests is less than 1. |
MultiModelImageCaptionDataGenerator(model_api_keys=None)
Bases: BaseDataGenerator
Multi-model image captioning data generator with lazy initialization.
This class extends BaseDataGenerator to provide a data generator for image captioning that supports multiple models with lazy initialization, to avoid API key validation during pipeline initialization.
Key Features: 1. Supports multiple models in a single instance. 2. Lazy initialization to avoid API key validation during initialization. 3. Dynamic model selection at runtime.
Initialize the MultiModelImageCaptionDataGenerator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_api_keys
|
dict[str, str] | None
|
Dictionary mapping model IDs to their API keys. Defaults to None, in which case no API keys passed during LMInvoker initialization. |
None
|
generate(elements, **kwargs)
Generate captions for elements with image structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
elements
|
list[dict[str, Any]]
|
List of dictionaries containing elements to be processed. |
required |
**kwargs
|
Any
|
Additional keyword arguments for the image captioning process. |
{}
|
Kwargs
model_id (str, optional): The ID of the model to use for image captioning. Defaults to DEFAULT_MODEL_ID which is using the "google/gemini-2.5-flash". system_prompt (str, optional): The system prompt to use for image captioning. Defaults to DEFAULT_SYSTEM_PROMPT. user_prompt (str, optional): The user prompt to use for image captioning. Defaults to DEFAULT_USER_PROMPT. default_hyperparameters (dict[str, Any]): Additional hyperparameters passed to the LMInvoker configuration. Defaults to {}. retry_config (dict[str, Any], optional): The retry config to use for the LM invoker. If not provided, will use the default retry config. Defaults to {}.
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
list[dict[str, Any]]: List of dictionaries containing the processed image data. Each dictionary will contain the original data. |