Skip to content

Data Generator

Document Processing Orchestrator Data Generator Package.

This module provides various data generator implementations for different types of data processing.

Authors

Devita (devita1@gdplabs.id) Yanfa Adi Putra (yanfa.a.putra@gdplabs.id)

BaseDataGenerator

Bases: ABC

Base class for data generator.

generate(elements, **kwargs) abstractmethod

Generates data for a list of chunks.

Parameters:

Name Type Description Default
elements Any

The elements to be used for generating data / metadata. ideally formatted as List[Dict].

required
**kwargs Any

Additional keyword arguments for customization.

{}

Returns:

Name Type Description
Any Any

The generated data, ideally formatted as List[Dict]. Each dictionary within the list are recommended to follows the structure of model 'Element', to ensure consistency and ease of use across Document Processing Orchestrator.

ImageCaptionDataGenerator(image_to_caption)

Bases: BaseDataGenerator

Data generator for creating captions from images using BaseImageToCaption.

Initialize the ImageCaptionDataGenerator.

Parameters:

Name Type Description Default
image_to_caption BaseImageToCaption

The image to caption converter instance.

required

generate(elements, **kwargs)

Generates captions by processing images in the input elements.

Parameters:

Name Type Description Default
elements list[dict[str, Any]]

List of dictionaries containing image data. Each dictionary should have an 'image_source' key with the image location.

required
**kwargs Any

Additional keyword arguments for the image captioning process.

{}
Kwargs

image_format_func (Callable[[str, Element], str], optional): Function to format the caption text. Defaults to None. element_processing_limit (int, optional): The maximum number of elements to process at a time. Defaults to 100.

Returns:

Type Description
list[dict[str, Any]]

list[dict[str, Any]]: List of dictionaries containing the processed image data. Each dictionary will contain the original data.

Raises:

Type Description
ValueError

If elements don't contain required image information.