Transliteration
Transliteration utilities powered by PyICU and Unidecode.
This module exposes a light-weight wrapper around ICU transliterators for script-to-script conversion alongside ASCII fallback helpers.
References
[1] https://unicode-org.github.io/icu/userguide/transforms/general/
SupportedScripts
Bases: StrEnum
Enumerates ICU-supported script targets for transliteration.
get_or_create_transliterator(target_script, source_script=None)
Return an ICU transliterator for the requested script pair.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_script |
str | SupportedScripts
|
Desired output script. |
required |
source_script |
str | SupportedScripts | None
|
Optional source script hint. Defaults to None, which causes ICU to auto-detect the source script. |
None
|
Returns:
| Type | Description |
|---|---|
Transliterator
|
icu.Transliterator: ICU transliterator instance configured for the requested scripts. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If either script is unsupported. |
RuntimeError
|
If creating the ICU transliterator fails. |
to_ascii(text, preserve_case=True)
Convert arbitrary Unicode text to an ASCII representation.
Examples:
from gllm_intl.text.transliteration import to_ascii
to_ascii("Привет мир")
# "Privet mir"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text |
str
|
Input text, potentially containing non-ASCII characters. |
required |
preserve_case |
bool
|
Whether to keep the output casing as-is. When |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
ASCII-only text suitable for search indexing and slug generation. |
transliterate(text, target_script, source_script=None)
Transliterate text between supported writing systems.
Examples:
from gllm_intl.text.transliteration import transliterate
transliterate("Привет мир", "LATIN", "CYRILLIC")
# "Privet mir"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text |
str
|
Source text to convert. Empty strings return empty strings. |
required |
target_script |
str | SupportedScripts
|
Desired output script. Must belong to |
required |
source_script |
str | SupportedScripts | None
|
Optional source script hint. Defaults to None, which causes ICU to auto-detect the source script. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Transliterated text. Characters without matches remain unchanged. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If provided scripts are unsupported. |
RuntimeError
|
If ICU transliterator creation fails. |