Normalization
Unicode text normalization utilities for internationalization workflows.
This module defines the public APIs for Unicode text normalization and diacritic
stripping. All functions will be implemented using Python's unicodedata standard
library module to ensure compliance with the Unicode Standard.
References
[1] https://unicode.org/reports/tr15/
NormalizationForm
Bases: StrEnum
Unicode normalization forms (See: [1]).
Attributes:
| Name | Type | Description |
|---|---|---|
NFD |
str
|
Canonical Decomposition. |
NFC |
str
|
Canonical Composition. |
NFKD |
str
|
Compatibility Decomposition. |
NFKC |
str
|
Compatibility Composition. |
coerce(form)
classmethod
Coerce a normalization form input into a NormalizationForm enum value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
form |
NormalizationForm | str
|
Candidate normalization form to coerce. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
NormalizationForm |
'NormalizationForm'
|
Enum value representing the normalization form. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
validate(form)
classmethod
Validate normalization form strings against NormalizationForm enum.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
form |
str
|
Normalization form candidate to validate. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
normalize_and_strip(text, form=NormalizationForm.NFC)
Normalize text and remove diacritics in one operation.
Examples:
normalize_and_strip("café") # "cafe"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text |
str | Sequence[str | None] | None
|
A single string, sequence of strings, or |
required |
form |
NormalizationForm | str
|
Unicode normalization form applied before stripping.
Defaults to |
NFC
|
Returns:
| Type | Description |
|---|---|
str | list[str]
|
The normalized text with diacritics removed, preserving the input type. |
normalize_text(text, form=NormalizationForm.NFC)
Normalize Unicode text to a canonical or compatibility form.
Normalization Forms
Unicode defines four forms: NFD (canonical decomposition), NFC (canonical composition), NFKD (compatibility decomposition), and NFKC (compatibility composition).
- "café" becomes "café" in NFD and stays "café" in NFC.
- "Äffin" becomes "Äffin" in NFKD and becomes "Äffin" in NFKC.
Examples:
normalize_text("café", form=NormalizationForm.NFD) # "café"
normalize_text("café", form=NormalizationForm.NFC) # "café"
normalize_text("Äffin", form=NormalizationForm.NFKD) # "Äffin"
normalize_text("Äffin", form=NormalizationForm.NFKC) # "Äffin"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text |
str | Sequence[str | None] | None
|
A single string, sequence of strings, or |
required |
form |
NormalizationForm | str
|
Unicode normalization form. Defaults to |
NFC
|
Returns:
| Type | Description |
|---|---|
str | list[str]
|
str | list[str]: The normalized text, preserving the input type |
remove_diacritics(text)
Remove combining diacritical marks from Unicode text.
Examples:
remove_diacritics("café") # "cafe"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text |
str | Sequence[str | None] | None
|
A single string, sequence of strings, or |
required |
Returns:
| Type | Description |
|---|---|
str | list[str]
|
str | list[str]: The text with diacritics removed, preserving the input type. |