Data Generator

Document Processing Orchestrator Data Generator Package.

This module provides various data generator implementations for different types of data processing.

Authors

Devita (devita1@gdplabs.id) Yanfa Adi Putra (yanfa.a.putra@gdplabs.id)

`BaseDataGenerator`

Bases: ABC

Base class for data generator.

`generate(elements, **kwargs)` `abstractmethod`

Generates data for a list of chunks.

Parameters:

Name	Type	Description	Default
`elements`	`Any`	The elements to be used for generating data / metadata. ideally formatted as List[Dict].	required
`**kwargs`	`Any`	Additional keyword arguments for customization.	`{}`

Returns:

Name	Type	Description
`Any`	`Any`	The generated data, ideally formatted as List[Dict]. Each dictionary within the list are recommended to follows the structure of model 'Element', to ensure consistency and ease of use across Document Processing Orchestrator.

`ImageCaptionDataGenerator(image_to_caption)`

Bases: BaseDataGenerator

Data generator for creating captions from images using BaseImageToCaption.

Initialize the ImageCaptionDataGenerator.

Parameters:

Name	Type	Description	Default
`image_to_caption`	`BaseImageToCaption`	The image to caption converter instance.	required

`generate(elements, **kwargs)`

Generates captions by processing images in the input elements.

Parameters:

Name	Type	Description	Default
`elements`	`list[dict[str, Any]]`	List of dictionaries containing image data. Each dictionary should have an 'image_source' key with the image location.	required
`**kwargs`	`Any`	Additional keyword arguments for the image captioning process.	`{}`

Kwargs

image_format_func (Callable[[str, Element], str], optional): Function to format the caption text. Defaults to None. element_processing_limit (int, optional): The maximum number of elements to process at a time. Defaults to 100. use_image_text_as_context (bool, optional): Whether to use the image text as context. If set to False, will use image_description instead. Defaults to False.

Returns:

Type	Description
`list[dict[str, Any]]`	list[dict[str, Any]]: List of dictionaries containing the processed image data. Each dictionary will contain the original data.

Raises:

Type	Description
`ValueError`	If elements don't contain required image information.

`MultiModelImageCaptionDataGenerator(model_api_keys=None)`

Bases: BaseDataGenerator

Multi-model image captioning data generator with lazy initialization.

This class extends BaseDataGenerator to provide a data generator for image captioning that supports multiple models with lazy initialization, to avoid API key validation during pipeline initialization.

Key Features: 1. Supports multiple models in a single instance. 2. Lazy initialization to avoid API key validation during initialization. 3. Dynamic model selection at runtime.

Initialize the MultiModelImageCaptionDataGenerator.

Parameters:

Name	Type	Description	Default
`model_api_keys`	`dict[str, str] \| None`	Dictionary mapping model IDs to their API keys. Defaults to None, in which case no API keys passed during LMInvoker initialization.	`None`

`generate(elements, **kwargs)`

Generate captions for elements with image structure.

Parameters:

Name	Type	Description	Default
`elements`	`list[dict[str, Any]]`	List of dictionaries containing elements to be processed.	required
`**kwargs`	`Any`	Additional keyword arguments for the image captioning process.	`{}`

Kwargs

model_id (str, optional): The ID of the model to use for image captioning. Defaults to DEFAULT_MODEL_ID which is using the "google/gemini-2.5-flash". system_prompt (str, optional): The system prompt to use for image captioning. Defaults to DEFAULT_SYSTEM_PROMPT. user_prompt (str, optional): The user prompt to use for image captioning. Defaults to DEFAULT_USER_PROMPT.

Returns:

Type	Description
`list[dict[str, Any]]`	list[dict[str, Any]]: List of dictionaries containing the processed image data. Each dictionary will contain the original data.

Data Generator

BaseDataGenerator

generate(elements, **kwargs) abstractmethod

ImageCaptionDataGenerator(image_to_caption)

generate(elements, **kwargs)

MultiModelImageCaptionDataGenerator(model_api_keys=None)

generate(elements, **kwargs)

`BaseDataGenerator`

`generate(elements, **kwargs)` `abstractmethod`

`ImageCaptionDataGenerator(image_to_caption)`

`generate(elements, **kwargs)`

`MultiModelImageCaptionDataGenerator(model_api_keys=None)`

`generate(elements, **kwargs)`