Skip to content

Schema

Unified schema exports for gllm_multimodal.

AudioTranscript

Bases: BaseModel

A class representing an audio transcript.

An audio transcript is a textual record of spoken content from audio or video sources, including timing information and optional language identification. It provides a structured way to store and manage transcribed audio data.

Attributes:

Name Type Description
text str

The text of the transcript.

start_time float

The start time of the transcript in seconds.

end_time float

The end time of the transcript in seconds.

lang_id str | None

The language ID of the transcript.

Caption

Bases: BaseModel

Result class for image captioning operations.

This class extends ImageToTextResult to provide a structured format for image captioning results, supporting: - Multiple caption types (one-liner, detailed, domain-specific) - Caption count tracking - Metadata storage for processing details

Attributes:

Name Type Description
image_one_liner str

Brief, single-sentence summary of the image. Defaults to empty string if not provided.

image_description str

Detailed, multi-sentence description of the image. Defaults to empty string if not provided.

domain_knowledge str

Domain-specific interpretation or context. Defaults to empty string if not provided.

number_of_captions int

Total number of distinct captions generated. Defaults to 0 if no captions are generated.

image_metadata dict[str, Any]

Additional information about the image such as image location.

attachments_context list[Attachment]

Optional list of external context objects (files, bytes, or pre-processed inputs) that can enrich captioning results. Bytes are automatically converted into Attachment objects via Attachment.from_bytes.

output_schema str

Output schema. Defaults to empty string if not provided.

schema_description str

Schema description. Defaults to empty string if not provided.

language str

Language of the captions. Defaults to "Indonesian" if not provided.

handle_none_attachments(attachments_value)

Normalize and validate attachments_context.

This method ensures that the attachments_context field is always a list of Attachment objects. It handles multiple input cases:

  • None -> returns an empty list
  • list[bytes] -> converts each item into an Attachment via Attachment.from_bytes
  • list[Attachment] -> keeps as-is
  • list[mixed] -> normalizes supported types, raises error on unsupported types
  • any other type -> raises TypeError

Parameters:

Name Type Description Default
attachments_value Any

Input value provided to attachments_context.

required

Returns:

Type Description
Any

list[Attachment]: A normalized list of Attachment objects.

Raises:

Type Description
TypeError

If an unsupported type is provided (e.g., str, dict).

handle_none_metadata(metadata_value) classmethod

Handle None values for image_metadata by using empty dict.

handle_none_number_of_captions(caption_value) classmethod

Handle None values for number_of_captions by using default.

handle_none_values(str_value) classmethod

Handle None values by converting them to default values.

CaptionResult

Bases: Caption

Result of a caption operation.

Attributes:

Name Type Description
captions str | list[str] | dict[str, Any]

The caption result.

Keyframe

Bases: BaseModel

Represents a keyframe extracted from a video segment.

Attributes:

Name Type Description
time_offset float

Time within the segment where the keyframe occurs.

caption str | None

Text description of this specific keyframe.

Mermaid

Bases: BaseModel

Mermaid additional metadata.

Attributes:

Name Type Description
diagram_type str

type of the diagram to be generated.

context str

additional context to generate mermaid.

Segment

Bases: BaseModel

Represents a video segment with its captions, transcripts, and keyframes.

Attributes:

Name Type Description
start_time float | None

The segment's starting time in seconds.

end_time float | None

The segment's ending time in seconds.

transcripts list[AudioTranscript]

Optional list of transcripts for the segment.

segment_caption list[str]

The single, rich description of the segment's action/plot.

keyframes list[Keyframe]

Optional list of keyframes extracted from the segment.

ensure_caption()

Ensure segment has caption, fallback to keyframes/transcripts if needed.

ensure_keyframes()

Ensure all keyframes time offset is non-negative.

ensure_transcripts()

Ensure all transcripts time offset is non-negative.

TextResult

Bases: BaseModel

Base class for all image-to-text operation results.

This class provides the foundation for structured results from any image-to-text operation, including: - Image Captioning - Scene Text Detection

Attributes:

Name Type Description
text str

The extracted or generated text from the image. This is the primary output of any image-to-text operation. May be empty if the operation fails or no text is found.

metadata dict[str, Any] | BaseModel

Additional metadata from the conversion process.

VideoCaptionMetadata

Bases: BaseModel

Metadata for video captioning results.

Attributes:

Name Type Description
video_summary str

A high-level summary of the entire video's plot, topic, or main events.

segments list[Segment]

List of video segments with their captions and metadata.

ensure_segment_end_time_greater_than_start_time()

Ensure segment end time is greater than start time.

If end time equal or lower than start time, then use next segment start time-1 as end time.

VideoCaptionResult

Bases: VideoCaptionMetadata

Backward-compatible model alias for video caption result payload.