Text
Text processing utilities for the gllm_intl package.
NormalizationForm
Bases: StrEnum
Unicode normalization forms (See: [1]).
Attributes:
| Name | Type | Description |
|---|---|---|
NFD |
str
|
Canonical Decomposition. |
NFC |
str
|
Canonical Composition. |
NFKD |
str
|
Compatibility Decomposition. |
NFKC |
str
|
Compatibility Composition. |
coerce(form)
classmethod
Coerce a normalization form input into a NormalizationForm enum value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
form |
NormalizationForm | str
|
Candidate normalization form to coerce. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
NormalizationForm |
'NormalizationForm'
|
Enum value representing the normalization form. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
validate(form)
classmethod
Validate normalization form strings against NormalizationForm enum.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
form |
str
|
Normalization form candidate to validate. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
SupportedScripts
Bases: StrEnum
Enumerates ICU-supported script targets for transliteration.
get_or_create_transliterator(target_script, source_script=None)
Return an ICU transliterator for the requested script pair.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_script |
str | SupportedScripts
|
Desired output script. |
required |
source_script |
str | SupportedScripts | None
|
Optional source script hint. Defaults to None, which causes ICU to auto-detect the source script. |
None
|
Returns:
| Type | Description |
|---|---|
Transliterator
|
icu.Transliterator: ICU transliterator instance configured for the requested scripts. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If either script is unsupported. |
RuntimeError
|
If creating the ICU transliterator fails. |
normalize_and_strip(text, form=NormalizationForm.NFC)
Normalize text and remove diacritics in one operation.
Examples:
normalize_and_strip("café") # "cafe"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text |
str | Sequence[str | None] | None
|
A single string, sequence of strings, or |
required |
form |
NormalizationForm | str
|
Unicode normalization form applied before stripping.
Defaults to |
NFC
|
Returns:
| Type | Description |
|---|---|
str | list[str]
|
The normalized text with diacritics removed, preserving the input type. |
normalize_text(text, form=NormalizationForm.NFC)
Normalize Unicode text to a canonical or compatibility form.
Normalization Forms
Unicode defines four forms: NFD (canonical decomposition), NFC (canonical composition), NFKD (compatibility decomposition), and NFKC (compatibility composition).
- "café" becomes "café" in NFD and stays "café" in NFC.
- "Äffin" becomes "Äffin" in NFKD and becomes "Äffin" in NFKC.
Examples:
normalize_text("café", form=NormalizationForm.NFD) # "café"
normalize_text("café", form=NormalizationForm.NFC) # "café"
normalize_text("Äffin", form=NormalizationForm.NFKD) # "Äffin"
normalize_text("Äffin", form=NormalizationForm.NFKC) # "Äffin"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text |
str | Sequence[str | None] | None
|
A single string, sequence of strings, or |
required |
form |
NormalizationForm | str
|
Unicode normalization form. Defaults to |
NFC
|
Returns:
| Type | Description |
|---|---|
str | list[str]
|
str | list[str]: The normalized text, preserving the input type |
remove_diacritics(text)
Remove combining diacritical marks from Unicode text.
Examples:
remove_diacritics("café") # "cafe"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text |
str | Sequence[str | None] | None
|
A single string, sequence of strings, or |
required |
Returns:
| Type | Description |
|---|---|
str | list[str]
|
str | list[str]: The text with diacritics removed, preserving the input type. |
to_ascii(text, preserve_case=True)
Convert arbitrary Unicode text to an ASCII representation.
Examples:
from gllm_intl.text.transliteration import to_ascii
to_ascii("Привет мир")
# "Privet mir"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text |
str
|
Input text, potentially containing non-ASCII characters. |
required |
preserve_case |
bool
|
Whether to keep the output casing as-is. When |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
ASCII-only text suitable for search indexing and slug generation. |
transliterate(text, target_script, source_script=None)
Transliterate text between supported writing systems.
Examples:
from gllm_intl.text.transliteration import transliterate
transliterate("Привет мир", "LATIN", "CYRILLIC")
# "Privet mir"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text |
str
|
Source text to convert. Empty strings return empty strings. |
required |
target_script |
str | SupportedScripts
|
Desired output script. Must belong to |
required |
source_script |
str | SupportedScripts | None
|
Optional source script hint. Defaults to None, which causes ICU to auto-detect the source script. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Transliterated text. Characters without matches remain unchanged. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If provided scripts are unsupported. |
RuntimeError
|
If ICU transliterator creation fails. |