Skip to content

Normalization

Unicode text normalization utilities for internationalization workflows.

This module defines the public APIs for Unicode text normalization and diacritic stripping. All functions will be implemented using Python's unicodedata standard library module to ensure compliance with the Unicode Standard.

Authors

Dimitrij Ray (dimitrij.ray@gdplabs.id)

References

[1] https://unicode.org/reports/tr15/

NormalizationForm

Bases: StrEnum

Unicode normalization forms (See: [1]).

Attributes:

Name Type Description
NFD str

Canonical Decomposition.

NFC str

Canonical Composition.

NFKD str

Compatibility Decomposition.

NFKC str

Compatibility Composition.

coerce(form) classmethod

Coerce a normalization form input into a NormalizationForm enum value.

Parameters:

Name Type Description Default
form NormalizationForm | str

Candidate normalization form to coerce.

required

Returns:

Name Type Description
NormalizationForm 'NormalizationForm'

Enum value representing the normalization form.

Raises:

Type Description
ValueError

If form is not a supported normalization form.

validate(form) classmethod

Validate normalization form strings against NormalizationForm enum.

Parameters:

Name Type Description Default
form str

Normalization form candidate to validate.

required

Raises:

Type Description
ValueError

If form is not one of the supported normalization forms.

normalize_and_strip(text, form=NormalizationForm.NFC)

Normalize text and remove diacritics in one operation.

Examples:

normalize_and_strip("café")  # "cafe"

Parameters:

Name Type Description Default
text str | Sequence[str | None] | None

A single string, sequence of strings, or None to process.

required
form NormalizationForm | str

Unicode normalization form applied before stripping. Defaults to NormalizationForm.NFC.

NFC

Returns:

Type Description
str | list[str]

The normalized text with diacritics removed, preserving the input type.

normalize_text(text, form=NormalizationForm.NFC)

Normalize Unicode text to a canonical or compatibility form.

Normalization Forms

Unicode defines four forms: NFD (canonical decomposition), NFC (canonical composition), NFKD (compatibility decomposition), and NFKC (compatibility composition).

  1. "café" becomes "café" in NFD and stays "café" in NFC.
  2. "Äffin" becomes "Äffin" in NFKD and becomes "Äffin" in NFKC.

Examples:

normalize_text("café", form=NormalizationForm.NFD)  # "café"
normalize_text("café", form=NormalizationForm.NFC)  # "café"
normalize_text("Äffin", form=NormalizationForm.NFKD)  # "Äffin"
normalize_text("Äffin", form=NormalizationForm.NFKC)  # "Äffin"

Parameters:

Name Type Description Default
text str | Sequence[str | None] | None

A single string, sequence of strings, or None to normalize.

required
form NormalizationForm | str

Unicode normalization form. Defaults to NormalizationForm.NFC.

NFC

Returns:

Type Description
str | list[str]

str | list[str]: The normalized text, preserving the input type str or list[str]). None inputs are converted to empty strings.

remove_diacritics(text)

Remove combining diacritical marks from Unicode text.

Examples:

remove_diacritics("café")  # "cafe"

Parameters:

Name Type Description Default
text str | Sequence[str | None] | None

A single string, sequence of strings, or None from which to remove diacritics.

required

Returns:

Type Description
str | list[str]

str | list[str]: The text with diacritics removed, preserving the input type. None inputs are converted to empty strings.