Pii Detector
A package to handle the detection of PII in text.
TextAnalyzer(additional_recognizers=None, nlp_configuration=None)
Bases: BaseTextAnalyzer
TextAnalyzer class to analyze the text and extract the PII entities.
Implemented using presidio pii library.
Attributes:
Name | Type | Description |
---|---|---|
analyzer |
AnalyzerEngine
|
The analyzer engine to analyze the text and extract the PII entities. |
Initialize the TextAnalyzer class and add custom recognizer.
The predefined recognizers are BankAccountNumberRecognizer, BPJSNumberRecognizer, IndonesianPhoneNumberRecognizer, GDPLabsEmployeeIdRecognizer, FacebookAccountRecognizer, FamilyCardNumberRecognizer, KTPRecognizer, LinkedinAccountRecognizer, MoneyRecognizer, NPWPRecognizer, OrganizationNameRecognizer, and ProjectNameRecognizer. We also add Presidio EmailRecognizer and PhoneRecognizer for Indonesian language.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
additional_recognizers |
list[EntityRecognizer] | None
|
The list of additional recognizers to be added. Default is None. |
None
|
nlp_configuration |
dict[str, Any] | None
|
The configuration for the NLP engine see https://microsoft.github.io/presidio/analyzer/customizing_nlp_models/. Default is None. |
None
|
analyze(text, language='id', entities=None, score_threshold=0.46, allow_list=None, allow_list_match='exact', regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE, **kwargs)
Analyze the text and extract the PII entities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to be analyzed. |
required |
language |
str
|
The language of the text to be analyzed. Default is "id". |
'id'
|
entities |
list[str] | None
|
The list of entities to be extracted. If you need to extract other entities, add the recognizer in the constructor. Default is None. |
None
|
score_threshold |
float | None
|
The minimum score threshold for the extracted entities. Default is 0.46. |
0.46
|
allow_list |
list | None
|
List of words that the user defines as being allowed to keep in the text. Default is None. |
None
|
allow_list_match |
str | None
|
The matching strategy for the allow list. Default is "exact". |
'exact'
|
regex_flags |
int | None
|
The regex flags for the text analysis. |
DOTALL | MULTILINE | IGNORECASE
|
**kwargs |
Any
|
Additional keyword arguments that may be needed for the text analysis process. |
{}
|
Returns:
Type | Description |
---|---|
list[RecognizerResult]
|
list[RecognizerResult]: The list of extracted entities |
TextAnonymizer(text_analyzer, operators=None, add_default_faker_operators=False, faker_seed=42, anonymizer_engine=None, deanonymizer_mapping=None, skip_format_duplicates=False)
Bases: BaseTextAnonymizer
TextAnonymizer class to anonymize the text based on the analyzer results.
Implemented using presidio pii library [1].
Attributes:
Name | Type | Description |
---|---|---|
anonymizer |
AnonymizerEngine
|
The anonymizer engine used in the anonymization process. |
operators |
dict[str, OperatorConfig]
|
The operators used in the anonymization process. |
anonymizer_mapping |
MappingDataType
|
The anonymizer mapping. |
deanonymizer_mapping |
MappingDataType
|
The deanonymizer mapping. |
Initialize the TextAnonymizer class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text_analyzer |
BaseTextAnalyzer
|
The analyzer engine used in the anonymization process |
required |
operators |
dic[str, OperatorConfig] | None
|
The operators used in the anonymization process see https://microsoft.github.io/presidio/anonymizer/#built-in-operators |
None
|
add_default_faker_operators |
bool
|
Whether to add default faker operators. Defaults to False. |
False
|
faker_seed |
int | None
|
The seed used for faker. Defaults to 42. |
42
|
anonymizer_engine |
AnonymizerEngine
|
The anonymizer engine used in the anonymization process. Defaults to None. |
None
|
deanonymizer_mapping |
DeanonymizerMapping
|
The deanonymizer mapping used to record mapping between anonymized and original text. Defaults to None. |
None
|
skip_format_duplicates |
bool
|
Whether to skip formatting duplicated operators. Defaults to False. |
False
|
anonymizer_mapping: MappingDataType
property
Return the anonymizer mapping.
Returns:
Name | Type | Description |
---|---|---|
MappingDataType |
MappingDataType
|
The anonymizer mapping. |
deanonymizer_mapping: MappingDataType
property
Return the deanonymizer mapping.
Returns:
Name | Type | Description |
---|---|---|
MappingDataType |
MappingDataType
|
The deanonymizer mapping. |
add_operators(operators)
Add operators to the anonymizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
operators |
dict[str, OperatorConfig]
|
Operators to add to the anonymizer. |
required |
anonymize(text, entities=None, language='id', conflict_resolution=None, analyzer_score_threshold=0.46, allow_list=None, allow_list_match='exact', regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE, **kwargs)
Anonymize the text based on the analyzer results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to be anonymized |
required |
entities |
list[str] | None
|
The list of entities to be anonymized. Defaults to None. |
None
|
language |
str
|
The language of the text to be analyzed, Defaults to "id". |
'id'
|
conflict_resolution |
ConflictResolutionStrategy | None
|
The conflict resolution strategy. Defaults to None. |
None
|
analyzer_score_threshold |
float
|
The score threshold for the analyzer. Defaults to 0.46. |
0.46
|
allow_list |
list | None
|
List of words that the user defines as being allowed to keep in the text. Defaults to None. |
None
|
allow_list_match |
str | None
|
The matching strategy for the allow list. Defaults to "exact". |
'exact'
|
regex_flags |
int | None
|
The regex flags for the text analysis. |
DOTALL | MULTILINE | IGNORECASE
|
**kwargs |
Any
|
Additional keyword arguments that may be needed for the text analysis process. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The anonymized text. |
deanonymize(text)
Deanonymize the text based on the anonymizer mapping.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The text to be deanonymized. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The deanonymized text or the original text if no mapping is found or the mapping is empty. |
remove_all_operators()
Remove all operators from the anonymizer.
reset_deanonymizer_mapping()
Reset the deanonymizer mapping.