Skip to content

Text

Text processing utilities for the gllm_intl package.

NormalizationForm

Bases: StrEnum

Unicode normalization forms (See: [1]).

Attributes:

Name Type Description
NFD str

Canonical Decomposition.

NFC str

Canonical Composition.

NFKD str

Compatibility Decomposition.

NFKC str

Compatibility Composition.

coerce(form) classmethod

Coerce a normalization form input into a NormalizationForm enum value.

Parameters:

Name Type Description Default
form NormalizationForm | str

Candidate normalization form to coerce.

required

Returns:

Name Type Description
NormalizationForm 'NormalizationForm'

Enum value representing the normalization form.

Raises:

Type Description
ValueError

If form is not a supported normalization form.

validate(form) classmethod

Validate normalization form strings against NormalizationForm enum.

Parameters:

Name Type Description Default
form str

Normalization form candidate to validate.

required

Raises:

Type Description
ValueError

If form is not one of the supported normalization forms.

SupportedScripts

Bases: StrEnum

Enumerates ICU-supported script targets for transliteration.

get_or_create_transliterator(target_script, source_script=None)

Return an ICU transliterator for the requested script pair.

Parameters:

Name Type Description Default
target_script str | SupportedScripts

Desired output script.

required
source_script str | SupportedScripts | None

Optional source script hint. Defaults to None, which causes ICU to auto-detect the source script.

None

Returns:

Type Description
Transliterator

icu.Transliterator: ICU transliterator instance configured for the requested scripts.

Raises:

Type Description
ValueError

If either script is unsupported.

RuntimeError

If creating the ICU transliterator fails.

normalize_and_strip(text, form=NormalizationForm.NFC)

Normalize text and remove diacritics in one operation.

Examples:

normalize_and_strip("café")  # "cafe"

Parameters:

Name Type Description Default
text str | Sequence[str | None] | None

A single string, sequence of strings, or None to process.

required
form NormalizationForm | str

Unicode normalization form applied before stripping. Defaults to NormalizationForm.NFC.

NFC

Returns:

Type Description
str | list[str]

The normalized text with diacritics removed, preserving the input type.

normalize_text(text, form=NormalizationForm.NFC)

Normalize Unicode text to a canonical or compatibility form.

Normalization Forms

Unicode defines four forms: NFD (canonical decomposition), NFC (canonical composition), NFKD (compatibility decomposition), and NFKC (compatibility composition).

  1. "café" becomes "café" in NFD and stays "café" in NFC.
  2. "Äffin" becomes "Äffin" in NFKD and becomes "Äffin" in NFKC.

Examples:

normalize_text("café", form=NormalizationForm.NFD)  # "café"
normalize_text("café", form=NormalizationForm.NFC)  # "café"
normalize_text("Äffin", form=NormalizationForm.NFKD)  # "Äffin"
normalize_text("Äffin", form=NormalizationForm.NFKC)  # "Äffin"

Parameters:

Name Type Description Default
text str | Sequence[str | None] | None

A single string, sequence of strings, or None to normalize.

required
form NormalizationForm | str

Unicode normalization form. Defaults to NormalizationForm.NFC.

NFC

Returns:

Type Description
str | list[str]

str | list[str]: The normalized text, preserving the input type str or list[str]). None inputs are converted to empty strings.

remove_diacritics(text)

Remove combining diacritical marks from Unicode text.

Examples:

remove_diacritics("café")  # "cafe"

Parameters:

Name Type Description Default
text str | Sequence[str | None] | None

A single string, sequence of strings, or None from which to remove diacritics.

required

Returns:

Type Description
str | list[str]

str | list[str]: The text with diacritics removed, preserving the input type. None inputs are converted to empty strings.

to_ascii(text, preserve_case=True)

Convert arbitrary Unicode text to an ASCII representation.

Examples:

from gllm_intl.text.transliteration import to_ascii

to_ascii("Привет мир")
# "Privet mir"

Parameters:

Name Type Description Default
text str

Input text, potentially containing non-ASCII characters.

required
preserve_case bool

Whether to keep the output casing as-is. When False the result is converted to lowercase for case-insensitive comparisons. Defaults to True.

True

Returns:

Name Type Description
str str

ASCII-only text suitable for search indexing and slug generation.

transliterate(text, target_script, source_script=None)

Transliterate text between supported writing systems.

Examples:

from gllm_intl.text.transliteration import transliterate

transliterate("Привет мир", "LATIN", "CYRILLIC")
# "Privet mir"

Parameters:

Name Type Description Default
text str

Source text to convert. Empty strings return empty strings.

required
target_script str | SupportedScripts

Desired output script. Must belong to SupportedScripts.

required
source_script str | SupportedScripts | None

Optional source script hint. Defaults to None, which causes ICU to auto-detect the source script.

None

Returns:

Name Type Description
str str

Transliterated text. Characters without matches remain unchanged.

Raises:

Type Description
ValueError

If provided scripts are unsupported.

RuntimeError

If ICU transliterator creation fails.