Schema

Unified schema exports for gllm_multimodal.

`AudioTranscript`

Bases: BaseModel

A class representing an audio transcript.

An audio transcript is a textual record of spoken content from audio or video sources, including timing information and optional language identification. It provides a structured way to store and manage transcribed audio data.

Attributes:

Name	Type	Description
`text`	`str`	The text of the transcript.
`start_time`	`float`	The start time of the transcript in seconds.
`end_time`	`float`	The end time of the transcript in seconds.
`lang_id`	`str \| None`	The language ID of the transcript.

`Caption`

Bases: BaseModel

Result class for image captioning operations.

This class extends ImageToTextResult to provide a structured format for image captioning results, supporting: - Multiple caption types (one-liner, detailed, domain-specific) - Caption count tracking - Metadata storage for processing details

Attributes:

Name	Type	Description
`image_one_liner`	`str`	Brief, single-sentence summary of the image. Defaults to empty string if not provided.
`image_description`	`str`	Detailed, multi-sentence description of the image. Defaults to empty string if not provided.
`domain_knowledge`	`str`	Domain-specific interpretation or context. Defaults to empty string if not provided.
`number_of_captions`	`int`	Total number of distinct captions generated. Defaults to 0 if no captions are generated.
`image_metadata`	`dict[str, Any]`	Additional information about the image such as image location.
`attachments_context`	`list[Attachment]`	Optional list of external context objects (files, bytes, or pre-processed inputs) that can enrich captioning results. Bytes are automatically converted into Attachment objects via `Attachment.from_bytes`.
`output_schema`	`str`	Output schema. Defaults to empty string if not provided.
`schema_description`	`str`	Schema description. Defaults to empty string if not provided.
`language`	`str`	Language of the captions. Defaults to "Indonesian" if not provided.

`handle_none_attachments(attachments_value)`

Normalize and validate attachments_context.

This method ensures that the attachments_context field is always a list of Attachment objects. It handles multiple input cases:

None -> returns an empty list
list[bytes] -> converts each item into an Attachment via Attachment.from_bytes
list[Attachment] -> keeps as-is
list[mixed] -> normalizes supported types, raises error on unsupported types
any other type -> raises TypeError

Parameters:

Name	Type	Description	Default
`attachments_value`	`Any`	Input value provided to `attachments_context`.	required

Returns:

Type	Description
`Any`	list[Attachment]: A normalized list of `Attachment` objects.

Raises:

Type	Description
`TypeError`	If an unsupported type is provided (e.g., str, dict).

`handle_none_metadata(metadata_value)` `classmethod`

Handle None values for image_metadata by using empty dict.

`handle_none_number_of_captions(caption_value)` `classmethod`

Handle None values for number_of_captions by using default.

`handle_none_values(str_value)` `classmethod`

Handle None values by converting them to default values.

`CaptionResult`

Bases: Caption

Result of a caption operation.

Attributes:

Name	Type	Description
`captions`	`str \| list[str] \| dict[str, Any]`	The caption result.

`Keyframe`

Bases: BaseModel

Represents a keyframe extracted from a video segment.

Attributes:

Name	Type	Description
`time_offset`	`float`	Time within the segment where the keyframe occurs.
`caption`	`str \| None`	Text description of this specific keyframe.

`Mermaid`

Bases: BaseModel

Mermaid additional metadata.

Attributes:

Name	Type	Description
`diagram_type`	`str`	type of the diagram to be generated.
`context`	`str`	additional context to generate mermaid.

`Segment`

Bases: BaseModel

Represents a video segment with its captions, transcripts, and keyframes.

Attributes:

Name	Type	Description
`start_time`	`float \| None`	The segment's starting time in seconds.
`end_time`	`float \| None`	The segment's ending time in seconds.
`transcripts`	`list[AudioTranscript]`	Optional list of transcripts for the segment.
`segment_caption`	`list[str]`	The single, rich description of the segment's action/plot.
`keyframes`	`list[Keyframe]`	Optional list of keyframes extracted from the segment.

`ensure_caption()`

Ensure segment has caption, fallback to keyframes/transcripts if needed.

`ensure_keyframes()`

Ensure all keyframes time offset is non-negative.

`ensure_transcripts()`

Ensure all transcripts time offset is non-negative.

`TextResult`

Bases: BaseModel

Base class for all image-to-text operation results.

This class provides the foundation for structured results from any image-to-text operation, including: - Image Captioning - Scene Text Detection

Attributes:

Name	Type	Description
`text`	`str`	The extracted or generated text from the image. This is the primary output of any image-to-text operation. May be empty if the operation fails or no text is found.
`metadata`	`dict[str, Any] \| BaseModel`	Additional metadata from the conversion process.

`VideoCaptionMetadata`

Bases: BaseModel

Metadata for video captioning results.

Attributes:

Name	Type	Description
`video_summary`	`str`	A high-level summary of the entire video's plot, topic, or main events.
`segments`	`list[Segment]`	List of video segments with their captions and metadata.

`ensure_segment_end_time_greater_than_start_time()`

Ensure segment end time is greater than start time.

If end time equal or lower than start time, then use next segment start time-1 as end time.

`VideoCaptionResult`

Bases: VideoCaptionMetadata

Backward-compatible model alias for video caption result payload.

Schema

AudioTranscript

Caption

handle_none_attachments(attachments_value)

handle_none_metadata(metadata_value) classmethod

handle_none_number_of_captions(caption_value) classmethod

handle_none_values(str_value) classmethod

CaptionResult

Keyframe

Mermaid

Segment

ensure_caption()

ensure_keyframes()

ensure_transcripts()

TextResult

VideoCaptionMetadata

ensure_segment_end_time_greater_than_start_time()

VideoCaptionResult

`AudioTranscript`

`Caption`

`handle_none_attachments(attachments_value)`

`handle_none_metadata(metadata_value)` `classmethod`

`handle_none_number_of_captions(caption_value)` `classmethod`

`handle_none_values(str_value)` `classmethod`

`CaptionResult`

`Keyframe`

`Mermaid`

`Segment`

`ensure_caption()`

`ensure_keyframes()`

`ensure_transcripts()`

`TextResult`

`VideoCaptionMetadata`

`ensure_segment_end_time_greater_than_start_time()`

`VideoCaptionResult`