Skip to content

Transliteration

Transliteration utilities powered by PyICU and Unidecode.

This module exposes a light-weight wrapper around ICU transliterators for script-to-script conversion alongside ASCII fallback helpers.

Authors

Dimitrij Ray (dimitrij.ray@gdplabs.id)

References

[1] https://unicode-org.github.io/icu/userguide/transforms/general/

SupportedScripts

Bases: StrEnum

Enumerates ICU-supported script targets for transliteration.

get_or_create_transliterator(target_script, source_script=None)

Return an ICU transliterator for the requested script pair.

Parameters:

Name Type Description Default
target_script str | SupportedScripts

Desired output script.

required
source_script str | SupportedScripts | None

Optional source script hint. Defaults to None, which causes ICU to auto-detect the source script.

None

Returns:

Type Description
Transliterator

icu.Transliterator: ICU transliterator instance configured for the requested scripts.

Raises:

Type Description
ValueError

If either script is unsupported.

RuntimeError

If creating the ICU transliterator fails.

to_ascii(text, preserve_case=True)

Convert arbitrary Unicode text to an ASCII representation.

Examples:

from gllm_intl.text.transliteration import to_ascii

to_ascii("Привет мир")
# "Privet mir"

Parameters:

Name Type Description Default
text str

Input text, potentially containing non-ASCII characters.

required
preserve_case bool

Whether to keep the output casing as-is. When False the result is converted to lowercase for case-insensitive comparisons. Defaults to True.

True

Returns:

Name Type Description
str str

ASCII-only text suitable for search indexing and slug generation.

transliterate(text, target_script, source_script=None)

Transliterate text between supported writing systems.

Examples:

from gllm_intl.text.transliteration import transliterate

transliterate("Привет мир", "LATIN", "CYRILLIC")
# "Privet mir"

Parameters:

Name Type Description Default
text str

Source text to convert. Empty strings return empty strings.

required
target_script str | SupportedScripts

Desired output script. Must belong to SupportedScripts.

required
source_script str | SupportedScripts | None

Optional source script hint. Defaults to None, which causes ICU to auto-detect the source script.

None

Returns:

Name Type Description
str str

Transliterated text. Characters without matches remain unchanged.

Raises:

Type Description
ValueError

If provided scripts are unsupported.

RuntimeError

If ICU transliterator creation fails.