Pii Detector

A package to handle the detection of PII in text.

`TextAnalyzer(additional_recognizers=None, nlp_configuration=None)`

Bases: BaseTextAnalyzer

TextAnalyzer class to analyze the text and extract the PII entities.

Implemented using presidio pii library.

Attributes:

Name	Type	Description
`analyzer`	`AnalyzerEngine`	The analyzer engine to analyze the text and extract the PII entities.

Initialize the TextAnalyzer class and add custom recognizer.

The predefined recognizers are BankAccountNumberRecognizer, BPJSNumberRecognizer, IndonesianPhoneNumberRecognizer, GDPLabsEmployeeIdRecognizer, FacebookAccountRecognizer, FamilyCardNumberRecognizer, KTPRecognizer, LinkedinAccountRecognizer, MoneyRecognizer, NPWPRecognizer, OrganizationNameRecognizer, and ProjectNameRecognizer. We also add Presidio EmailRecognizer and PhoneRecognizer for Indonesian language.

Parameters:

Name	Type	Description	Default
`additional_recognizers`	`list[EntityRecognizer] \| None`	The list of additional recognizers to be added. Default is None.	`None`
`nlp_configuration`	`dict[str, Any] \| None`	The configuration for the NLP engine see https://microsoft.github.io/presidio/analyzer/customizing_nlp_models/. Default is None.	`None`

`analyze(text, language='id', entities=None, score_threshold=0.46, allow_list=None, allow_list_match='exact', regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE, **kwargs)`

Analyze the text and extract the PII entities.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to be analyzed.	required
`language`	`str`	The language of the text to be analyzed. Default is "id".	`'id'`
`entities`	`list[str] \| None`	The list of entities to be extracted. If you need to extract other entities, add the recognizer in the constructor. Default is None.	`None`
`score_threshold`	`float \| None`	The minimum score threshold for the extracted entities. Default is 0.46.	`0.46`
`allow_list`	`list \| None`	List of words that the user defines as being allowed to keep in the text. Default is None.	`None`
`allow_list_match`	`str \| None`	The matching strategy for the allow list. Default is "exact".	`'exact'`
`regex_flags`	`int \| None`	The regex flags for the text analysis.	`DOTALL \| MULTILINE \| IGNORECASE`
`**kwargs`	`Any`	Additional keyword arguments that may be needed for the text analysis process.	`{}`

Returns:

Type	Description
`list[RecognizerResult]`	list[RecognizerResult]: The list of extracted entities

`TextAnonymizer(text_analyzer, operators=None, add_default_faker_operators=False, faker_seed=42, anonymizer_engine=None, deanonymizer_mapping=None, skip_format_duplicates=False)`

Bases: BaseTextAnonymizer

TextAnonymizer class to anonymize the text based on the analyzer results.

Implemented using presidio pii library [1].

Attributes:

Name	Type	Description
`anonymizer`	`AnonymizerEngine`	The anonymizer engine used in the anonymization process.
`operators`	`dict[str, OperatorConfig]`	The operators used in the anonymization process.
`anonymizer_mapping`	`MappingDataType`	The anonymizer mapping.
`deanonymizer_mapping`	`MappingDataType`	The deanonymizer mapping.

Initialize the TextAnonymizer class.

Parameters:

Name	Type	Description	Default
`text_analyzer`	`BaseTextAnalyzer`	The analyzer engine used in the anonymization process	required
`operators`	`dic[str, OperatorConfig] \| None`	The operators used in the anonymization process see https://microsoft.github.io/presidio/anonymizer/#built-in-operators	`None`
`add_default_faker_operators`	`bool`	Whether to add default faker operators. Defaults to False.	`False`
`faker_seed`	`int \| None`	The seed used for faker. Defaults to 42.	`42`
`anonymizer_engine`	`AnonymizerEngine`	The anonymizer engine used in the anonymization process. Defaults to None.	`None`
`deanonymizer_mapping`	`DeanonymizerMapping`	The deanonymizer mapping used to record mapping between anonymized and original text. Defaults to None.	`None`
`skip_format_duplicates`	`bool`	Whether to skip formatting duplicated operators. Defaults to False.	`False`

`anonymizer_mapping` `property`

Return the anonymizer mapping.

Returns:

Name	Type	Description
`MappingDataType`	`MappingDataType`	The anonymizer mapping.

`deanonymizer_mapping` `property`

Return the deanonymizer mapping.

Returns:

Name	Type	Description
`MappingDataType`	`MappingDataType`	The deanonymizer mapping.

`add_operators(operators)`

Add operators to the anonymizer.

Parameters:

Name	Type	Description	Default
`operators`	`dict[str, OperatorConfig]`	Operators to add to the anonymizer.	required

`anonymize(text, entities=None, language='id', conflict_resolution=None, analyzer_score_threshold=0.46, allow_list=None, allow_list_match='exact', regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE, **kwargs)`

Anonymize the text based on the analyzer results.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to be anonymized	required
`entities`	`list[str] \| None`	The list of entities to be anonymized. Defaults to None.	`None`
`language`	`str`	The language of the text to be analyzed, Defaults to "id".	`'id'`
`conflict_resolution`	`ConflictResolutionStrategy \| None`	The conflict resolution strategy. Defaults to None.	`None`
`analyzer_score_threshold`	`float`	The score threshold for the analyzer. Defaults to 0.46.	`0.46`
`allow_list`	`list \| None`	List of words that the user defines as being allowed to keep in the text. Defaults to None.	`None`
`allow_list_match`	`str \| None`	The matching strategy for the allow list. Defaults to "exact".	`'exact'`
`regex_flags`	`int \| None`	The regex flags for the text analysis.	`DOTALL \| MULTILINE \| IGNORECASE`
`**kwargs`	`Any`	Additional keyword arguments that may be needed for the text analysis process.	`{}`

Returns:

Name	Type	Description
`str`	`str`	The anonymized text.

`deanonymize(text)`

Deanonymize the text based on the anonymizer mapping.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to be deanonymized.	required

Returns:

Name	Type	Description
`str`	`str`	The deanonymized text or the original text if no mapping is found or the mapping is empty.

`remove_all_operators()`

Remove all operators from the anonymizer.

`reset_deanonymizer_mapping()`

Reset the deanonymizer mapping.

Pii Detector

TextAnalyzer(additional_recognizers=None, nlp_configuration=None)

analyze(text, language='id', entities=None, score_threshold=0.46, allow_list=None, allow_list_match='exact', regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE, **kwargs)

TextAnonymizer(text_analyzer, operators=None, add_default_faker_operators=False, faker_seed=42, anonymizer_engine=None, deanonymizer_mapping=None, skip_format_duplicates=False)

anonymizer_mapping property

deanonymizer_mapping property

add_operators(operators)

anonymize(text, entities=None, language='id', conflict_resolution=None, analyzer_score_threshold=0.46, allow_list=None, allow_list_match='exact', regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE, **kwargs)

deanonymize(text)

remove_all_operators()

reset_deanonymizer_mapping()

`TextAnalyzer(additional_recognizers=None, nlp_configuration=None)`

`analyze(text, language='id', entities=None, score_threshold=0.46, allow_list=None, allow_list_match='exact', regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE, **kwargs)`

`TextAnonymizer(text_analyzer, operators=None, add_default_faker_operators=False, faker_seed=42, anonymizer_engine=None, deanonymizer_mapping=None, skip_format_duplicates=False)`

`anonymizer_mapping` `property`

`deanonymizer_mapping` `property`

`add_operators(operators)`

`anonymize(text, entities=None, language='id', conflict_resolution=None, analyzer_score_threshold=0.46, allow_list=None, allow_list_match='exact', regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE, **kwargs)`

`deanonymize(text)`

`remove_all_operators()`

`reset_deanonymizer_mapping()`