Skip to content

Data Generator

Document Processing Orchestrator Data Generator Package.

This module provides various data generator implementations for different types of data processing.

BaseDataGenerator

Bases: ABC

Base class for data generator.

generate(elements, **kwargs) abstractmethod

Generates data for a list of chunks.

Parameters:

Name Type Description Default
elements Any

The elements to be used for generating data / metadata. ideally formatted as List[Dict].

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Name Type Description
Any Any

The generated data, ideally formatted as List[Dict]. Each dictionary within the list are recommended to follows the structure of model 'Element', to ensure consistency and ease of use across Document Processing Orchestrator.

ImageCaptionDataGenerator(image_to_caption)

Bases: BaseDataGenerator

Data generator for creating captions from images using BaseImageToCaption.

Initialize the ImageCaptionDataGenerator.

Parameters:

Name Type Description Default
image_to_caption BaseImageToCaption

The image to caption converter instance.

required

generate(elements, **kwargs)

Generates captions by processing images in the input elements.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

List of dictionaries containing image data. Each dictionary should have an 'image_source' key with the image location.

required
**kwargs Any

Additional keyword arguments for the image captioning process.

{}
Kwargs

image_format_func (Callable[[str, Element], str], optional): Function to format the caption text. Defaults to None. element_processing_limit (int, optional): The maximum number of elements to process at a time. Defaults to 100. use_image_text_as_context (bool, optional): Whether to use the image text as context. If set to False, will use image_description instead. Defaults to False.

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of dictionaries containing the processed image data. Each dictionary will contain the original data.

Raises:

Type Description
ValueError

If elements don't contain required image information.

LLMTextRewriteDataGenerator(model_api_keys=None, default_model_id=DEFAULT_MODEL_ID, default_system_prompt=DEFAULT_SYSTEM_PROMPT, default_user_prompt=DEFAULT_USER_PROMPT, default_text_rewrite_enabled=True, default_structures_to_rewrite=None, default_hyperparameters=None, default_retry_config=None, default_include_media_images_as_context=True, max_concurrent_requests=DEFAULT_MAX_CONCURRENT_REQUESTS)

Bases: BaseDataGenerator

LLM-powered element text rewrite data generator with lazy initialization and batching.

This generator rewrites the text field of document elements using an LLM while preserving element structure and metadata. It supports dynamic model selection, configuration-based processor caching, concurrent batch processing, and optional image context for multimodal models.

Designed primarily for document normalization tasks such as OCR cleanup, formatting correction, and layout refinement.

Initialize the LLMTextRewriteDataGenerator.

Parameters:

Name Type Description Default
model_api_keys dict[str, str] | None

Dictionary mapping model IDs to their API keys. Defaults to None, in which case no API keys are passed during LMInvoker initialization.

None
default_model_id str

Default model ID used when model_id is not provided to generate(). Defaults to DEFAULT_MODEL_ID.

DEFAULT_MODEL_ID
default_system_prompt str

Default system prompt used when system_prompt is not provided to generate(). Defaults to DEFAULT_SYSTEM_PROMPT.

DEFAULT_SYSTEM_PROMPT
default_user_prompt str

Default user prompt used when user_prompt is not provided to generate(). Defaults to DEFAULT_USER_PROMPT.

DEFAULT_USER_PROMPT
default_text_rewrite_enabled bool

Default value for text_rewrite_enabled in generate(). Defaults to True.

True
default_structures_to_rewrite list[str] | None

Default value for structures_to_rewrite in generate(). If None, all structures are eligible for rewriting. Defaults to None.

None
default_hyperparameters dict[str, Any] | None

Default hyperparameters passed to the LMInvoker when default_hyperparameters is not provided in generate(). Defaults to None.

None
default_retry_config dict[str, Any] | None

Default retry configuration passed to the LMInvoker when retry_config is not provided in generate(). Defaults to None.

None
default_include_media_images_as_context bool

Default value for include_media_images_as_context in generate(). Defaults to True.

True
max_concurrent_requests int

Default maximum number of concurrent LLM requests per generate() call when max_concurrent_requests is not provided to generate(). Defaults to DEFAULT_MAX_CONCURRENT_REQUESTS.

DEFAULT_MAX_CONCURRENT_REQUESTS

Raises:

Type Description
ValueError

If max_concurrent_requests is less than 1.

generate(elements, **kwargs)

Rewrite element.text for elements matching structures_to_rewrite using an LLM.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

List of dictionaries containing elements to be processed.

required
**kwargs Any

Additional keyword arguments for the LLM text rewrite process.

{}
Kwargs

text_rewrite_enabled (bool, optional): Whether to enable LLM text rewriting. Defaults to True or the value configured in init. structures_to_rewrite (list[str], optional): List of element structure types whose text field should be rewritten. Elements with structures not in this list are passed through unchanged. Defaults to the value configured in init (applies rewriting to all structures if None). model_id (str, optional): The ID of the model to use for text rewriting. Defaults to DEFAULT_MODEL_ID or the value configured in init. system_prompt (str, optional): The system prompt to use for text rewriting. Defaults to DEFAULT_SYSTEM_PROMPT or the value configured in init. user_prompt (str, optional): The user prompt template for text rewriting. Must contain a {text} placeholder. Defaults to DEFAULT_USER_PROMPT or the value configured in init. default_hyperparameters (dict[str, Any], optional): Additional hyperparameters passed to the LMInvoker configuration. Defaults to {} or the value configured in init. retry_config (dict[str, Any], optional): Retry configuration passed to the LMInvoker. Defaults to {} or the value configured in init. include_media_images_as_context (bool, optional): Whether to attach associated media images from element.metadata.media as visual context when invoking the LLM. Defaults to True or the value configured in init. max_concurrent_requests (int, optional): Maximum number of LLM requests in flight at once for this generate() call. Defaults to the value configured in init.

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of dictionaries with text rewritten for matching elements. Non-matching elements are returned unchanged.

Raises:

Type Description
ValueError

If max_concurrent_requests is less than 1.

MultiModelImageCaptionDataGenerator(model_api_keys=None)

Bases: BaseDataGenerator

Multi-model image captioning data generator with lazy initialization.

This class extends BaseDataGenerator to provide a data generator for image captioning that supports multiple models with lazy initialization, to avoid API key validation during pipeline initialization.

Key Features: 1. Supports multiple models in a single instance. 2. Lazy initialization to avoid API key validation during initialization. 3. Dynamic model selection at runtime.

Initialize the MultiModelImageCaptionDataGenerator.

Parameters:

Name Type Description Default
model_api_keys dict[str, str] | None

Dictionary mapping model IDs to their API keys. Defaults to None, in which case no API keys passed during LMInvoker initialization.

None

generate(elements, **kwargs)

Generate captions for elements with image structure.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

List of dictionaries containing elements to be processed.

required
**kwargs Any

Additional keyword arguments for the image captioning process.

{}
Kwargs

model_id (str, optional): The ID of the model to use for image captioning. Defaults to DEFAULT_MODEL_ID which is using the "google/gemini-2.5-flash". system_prompt (str, optional): The system prompt to use for image captioning. Defaults to DEFAULT_SYSTEM_PROMPT. user_prompt (str, optional): The user prompt to use for image captioning. Defaults to DEFAULT_USER_PROMPT. default_hyperparameters (dict[str, Any]): Additional hyperparameters passed to the LMInvoker configuration. Defaults to {}. retry_config (dict[str, Any], optional): The retry config to use for the LM invoker. If not provided, will use the default retry config. Defaults to {}.

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of dictionaries containing the processed image data. Each dictionary will contain the original data.