Skip to content

Pii Detector

A package to handle the detection of PII in text.

TextAnalyzer(additional_recognizers=None, nlp_configuration=None)

Bases: BaseTextAnalyzer

TextAnalyzer class to analyze the text and extract the PII entities.

Implemented using presidio pii library.

Attributes:

Name Type Description
analyzer AnalyzerEngine

The analyzer engine to analyze the text and extract the PII entities.

Initialize the TextAnalyzer class and add custom recognizer.

The predefined recognizers are BankAccountNumberRecognizer, BPJSNumberRecognizer, IndonesianPhoneNumberRecognizer, GDPLabsEmployeeIdRecognizer, FacebookAccountRecognizer, FamilyCardNumberRecognizer, KTPRecognizer, LinkedinAccountRecognizer, MoneyRecognizer, NPWPRecognizer, OrganizationNameRecognizer, and ProjectNameRecognizer. We also add Presidio EmailRecognizer and PhoneRecognizer for Indonesian language.

Parameters:

Name Type Description Default
additional_recognizers list[EntityRecognizer] | None

The list of additional recognizers to be added. Default is None.

None
nlp_configuration dict[str, Any] | None

The configuration for the NLP engine see https://microsoft.github.io/presidio/analyzer/customizing_nlp_models/. Default is None.

None

analyze(text, language='id', entities=None, score_threshold=0.46, allow_list=None, allow_list_match='exact', regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE, **kwargs)

Analyze the text and extract the PII entities.

Parameters:

Name Type Description Default
text str

The text to be analyzed.

required
language str

The language of the text to be analyzed. Default is "id".

'id'
entities list[str] | None

The list of entities to be extracted. If you need to extract other entities, add the recognizer in the constructor. Default is None.

None
score_threshold float | None

The minimum score threshold for the extracted entities. Default is 0.46.

0.46
allow_list list | None

List of words that the user defines as being allowed to keep in the text. Default is None.

None
allow_list_match str | None

The matching strategy for the allow list. Default is "exact".

'exact'
regex_flags int | None

The regex flags for the text analysis.

DOTALL | MULTILINE | IGNORECASE
**kwargs Any

Additional keyword arguments that may be needed for the text analysis process.

{}

Returns:

Type Description
list[RecognizerResult]

list[RecognizerResult]: The list of extracted entities

TextAnonymizer(text_analyzer, operators=None, add_default_faker_operators=False, faker_seed=42, anonymizer_engine=None, deanonymizer_mapping=None, skip_format_duplicates=False)

Bases: BaseTextAnonymizer

TextAnonymizer class to anonymize the text based on the analyzer results.

Implemented using presidio pii library [1].

Attributes:

Name Type Description
anonymizer AnonymizerEngine

The anonymizer engine used in the anonymization process.

operators dict[str, OperatorConfig]

The operators used in the anonymization process.

anonymizer_mapping MappingDataType

The anonymizer mapping.

deanonymizer_mapping MappingDataType

The deanonymizer mapping.

Initialize the TextAnonymizer class.

Parameters:

Name Type Description Default
text_analyzer BaseTextAnalyzer

The analyzer engine used in the anonymization process

required
operators dic[str, OperatorConfig] | None

The operators used in the anonymization process see https://microsoft.github.io/presidio/anonymizer/#built-in-operators

None
add_default_faker_operators bool

Whether to add default faker operators. Defaults to False.

False
faker_seed int | None

The seed used for faker. Defaults to 42.

42
anonymizer_engine AnonymizerEngine

The anonymizer engine used in the anonymization process. Defaults to None.

None
deanonymizer_mapping DeanonymizerMapping

The deanonymizer mapping used to record mapping between anonymized and original text. Defaults to None.

None
skip_format_duplicates bool

Whether to skip formatting duplicated operators. Defaults to False.

False

anonymizer_mapping: MappingDataType property

Return the anonymizer mapping.

Returns:

Name Type Description
MappingDataType MappingDataType

The anonymizer mapping.

deanonymizer_mapping: MappingDataType property

Return the deanonymizer mapping.

Returns:

Name Type Description
MappingDataType MappingDataType

The deanonymizer mapping.

add_operators(operators)

Add operators to the anonymizer.

Parameters:

Name Type Description Default
operators dict[str, OperatorConfig]

Operators to add to the anonymizer.

required

anonymize(text, entities=None, language='id', conflict_resolution=None, analyzer_score_threshold=0.46, allow_list=None, allow_list_match='exact', regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE, **kwargs)

Anonymize the text based on the analyzer results.

Parameters:

Name Type Description Default
text str

The text to be anonymized

required
entities list[str] | None

The list of entities to be anonymized. Defaults to None.

None
language str

The language of the text to be analyzed, Defaults to "id".

'id'
conflict_resolution ConflictResolutionStrategy | None

The conflict resolution strategy. Defaults to None.

None
analyzer_score_threshold float

The score threshold for the analyzer. Defaults to 0.46.

0.46
allow_list list | None

List of words that the user defines as being allowed to keep in the text. Defaults to None.

None
allow_list_match str | None

The matching strategy for the allow list. Defaults to "exact".

'exact'
regex_flags int | None

The regex flags for the text analysis.

DOTALL | MULTILINE | IGNORECASE
**kwargs Any

Additional keyword arguments that may be needed for the text analysis process.

{}

Returns:

Name Type Description
str str

The anonymized text.

deanonymize(text)

Deanonymize the text based on the anonymizer mapping.

Parameters:

Name Type Description Default
text str

The text to be deanonymized.

required

Returns:

Name Type Description
str str

The deanonymized text or the original text if no mapping is found or the mapping is empty.

remove_all_operators()

Remove all operators from the anonymizer.

reset_deanonymizer_mapping()

Reset the deanonymizer mapping.