Schema
Unified schema exports for gllm_multimodal.
AudioTranscript
Bases: BaseModel
A class representing an audio transcript.
An audio transcript is a textual record of spoken content from audio or video sources, including timing information and optional language identification. It provides a structured way to store and manage transcribed audio data.
Attributes:
| Name | Type | Description |
|---|---|---|
text |
str
|
The text of the transcript. |
start_time |
float
|
The start time of the transcript in seconds. |
end_time |
float
|
The end time of the transcript in seconds. |
lang_id |
str | None
|
The language ID of the transcript. |
Caption
Bases: BaseModel
Result class for image captioning operations.
This class extends ImageToTextResult to provide a structured format for image captioning results, supporting: - Multiple caption types (one-liner, detailed, domain-specific) - Caption count tracking - Metadata storage for processing details
Attributes:
| Name | Type | Description |
|---|---|---|
image_one_liner |
str
|
Brief, single-sentence summary of the image. Defaults to empty string if not provided. |
image_description |
str
|
Detailed, multi-sentence description of the image. Defaults to empty string if not provided. |
domain_knowledge |
str
|
Domain-specific interpretation or context. Defaults to empty string if not provided. |
number_of_captions |
int
|
Total number of distinct captions generated. Defaults to 0 if no captions are generated. |
image_metadata |
dict[str, Any]
|
Additional information about the image such as image location. |
attachments_context |
list[Attachment]
|
Optional list of external context
objects (files, bytes, or pre-processed inputs) that can enrich
captioning results. Bytes are automatically converted into Attachment
objects via |
output_schema |
str
|
Output schema. Defaults to empty string if not provided. |
schema_description |
str
|
Schema description. Defaults to empty string if not provided. |
language |
str
|
Language of the captions. Defaults to "Indonesian" if not provided. |
handle_none_attachments(attachments_value)
Normalize and validate attachments_context.
This method ensures that the attachments_context field is always a list of
Attachment objects. It handles multiple input cases:
- None -> returns an empty list
- list[bytes] -> converts each item into an Attachment via
Attachment.from_bytes - list[Attachment] -> keeps as-is
- list[mixed] -> normalizes supported types, raises error on unsupported types
- any other type -> raises TypeError
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attachments_value
|
Any
|
Input value provided to |
required |
Returns:
| Type | Description |
|---|---|
Any
|
list[Attachment]: A normalized list of |
Raises:
| Type | Description |
|---|---|
TypeError
|
If an unsupported type is provided (e.g., str, dict). |
handle_none_metadata(metadata_value)
classmethod
Handle None values for image_metadata by using empty dict.
handle_none_number_of_captions(caption_value)
classmethod
Handle None values for number_of_captions by using default.
handle_none_values(str_value)
classmethod
Handle None values by converting them to default values.
CaptionResult
Bases: Caption
Result of a caption operation.
Attributes:
| Name | Type | Description |
|---|---|---|
captions |
str | list[str] | dict[str, Any]
|
The caption result. |
Keyframe
Bases: BaseModel
Represents a keyframe extracted from a video segment.
Attributes:
| Name | Type | Description |
|---|---|---|
time_offset |
float
|
Time within the segment where the keyframe occurs. |
caption |
str | None
|
Text description of this specific keyframe. |
Mermaid
Bases: BaseModel
Mermaid additional metadata.
Attributes:
| Name | Type | Description |
|---|---|---|
diagram_type |
str
|
type of the diagram to be generated. |
context |
str
|
additional context to generate mermaid. |
Segment
Bases: BaseModel
Represents a video segment with its captions, transcripts, and keyframes.
Attributes:
| Name | Type | Description |
|---|---|---|
start_time |
float | None
|
The segment's starting time in seconds. |
end_time |
float | None
|
The segment's ending time in seconds. |
transcripts |
list[AudioTranscript]
|
Optional list of transcripts for the segment. |
segment_caption |
list[str]
|
The single, rich description of the segment's action/plot. |
keyframes |
list[Keyframe]
|
Optional list of keyframes extracted from the segment. |
ensure_caption()
Ensure segment has caption, fallback to keyframes/transcripts if needed.
ensure_keyframes()
Ensure all keyframes time offset is non-negative.
ensure_transcripts()
Ensure all transcripts time offset is non-negative.
TextResult
Bases: BaseModel
Base class for all image-to-text operation results.
This class provides the foundation for structured results from any image-to-text operation, including: - Image Captioning - Scene Text Detection
Attributes:
| Name | Type | Description |
|---|---|---|
text |
str
|
The extracted or generated text from the image. This is the primary output of any image-to-text operation. May be empty if the operation fails or no text is found. |
metadata |
dict[str, Any] | BaseModel
|
Additional metadata from the conversion process. |
VideoCaptionMetadata
Bases: BaseModel
Metadata for video captioning results.
Attributes:
| Name | Type | Description |
|---|---|---|
video_summary |
str
|
A high-level summary of the entire video's plot, topic, or main events. |
segments |
list[Segment]
|
List of video segments with their captions and metadata. |
ensure_segment_end_time_greater_than_start_time()
Ensure segment end time is greater than start time.
If end time equal or lower than start time, then use next segment start time-1 as end time.