PII Detection and Masking in RAG Pipelines

March 28, 2024

1

Introduction

In at present’s data-driven world, safeguarding Personally Identifiable Info (PII) is paramount. PII encompasses information like names, addresses, telephone numbers, and monetary information, important for particular person identification. With the rise of synthetic intelligence and its huge information processing capabilities, defending PII whereas harnessing its potential for customized experiences is essential. Retrieval Augmented Technology (RAG) emerges as an answer, mixing data retrieval with superior language era fashions. These programs sift by means of in depth information repositories to extract related data, refining AI-generated outputs for precision and context.

But, the utilization of person information poses dangers of unintentional PII publicity. PII detection applied sciences mitigate this threat, robotically figuring out and concealing delicate information. With stringent privateness measures, RAG fashions leverage person information to supply tailor-made providers whereas upholding privateness requirements. This integration underscores the continuing endeavor to steadiness customized information utilization with person privateness, prioritizing information confidentiality as AI expertise advances.

Studying Targets

The article delves into growing a potent PII detection device with the Llama Index and Presidio, a Microsoft anonymization library.
Presidio swiftly detects and anonymizes delicate private information, providing customers customizable PII detection instruments with superior methods like NER, Common Expressions, and checksum algorithms.
Customers can customise the anonymization course of with Presidio’s versatile framework, enhancing management.
Llama Index seamlessly integrates Presidio’s performance for an accessible resolution.
The article compares Presidio with NER PII post-processing instruments, showcasing Presidio’s superiority and sensible advantages.

PII Detection and Masking in RAG Pipelines

This text was revealed as part of the Knowledge Science Blogathon.

Arms-on PII detection utilizing Llama Index Put up-processing instruments

Let’s begin our exploration with the NERPIINodePostprocessor device from Llama Index. For that, we might want to set up a couple of mandatory packages.

The checklist of mandatory packages is listed beneath:

llama-index==0.10.22
llama-index-agent-openai==0.1.7
llama-index-cli==0.1.11
llama-index-core==0.10.23
llama-index-indices-managed-llama-cloud==0.1.4
llama-index-legacy==0.9.48
llama-index-multi-modal-llms-openai==0.1.4
llama-index-postprocessor-presidio==0.1.1
llama-parse==0.3.9
llamaindex-py-client==0.1.13
presidio-analyzer==2.2.353
presidio-anonymizer==2.2.353
pydantic==2.5.3
pydantic_core==2.14.6
spacy==3.7.4
torch==2.2.1+cpu
transformers==4.39.1

To check the device, we require dummy information for PII detection. For experimentation, handwritten texts containing fabricated names, dates, bank card numbers, telephone numbers, and electronic mail addresses have been utilized. Alternatively, any textual content of selection can be utilized for testing, or GPT might be employed to generate textual content. The next texts will likely be utilized for our experimentation:

textual content = """
Hello there! You may name me Max Turner. Attain out at [email protected],
and you will find me strolling the streets of Vienna. My plastic good friend, the 
Mastercard, reads 5300-1234-5678-9000. Ever vibed at a gig by Zsofia Kovacs? 
I am curious. As for my card, it has a restrict I might reasonably not disclose right here; 
nevertheless, my financial institution particulars are as follows: AT611904300235473201. Turner is the 
household title. Tracing my roots, I've received ancestors named Leopold Turner and
Elisabeth Baumgartner. Additionally, a fast FYI: I attempted to go to your web site, however 
my IP (203.0.113.5) appears to be barred. I did, nevertheless, handle to put up a 
visible at this hyperlink: http://MegaMovieMoments.fi.
"""

Step 1: Initializing the Software and Importing Dependencies

With the packages put in and pattern textual content ready, we proceed to make the most of the NERPIINodePostprocessor device. Importing NERPIINodePostprocessor from Llama Index is critical, together with importing the TextNode schema from Llama Index to create a textual content node. This step is essential as NERPIINodePostprocessor operates on TextNode objects reasonably than uncooked strings.

Beneath is the code snippet for imports:

from llama_index.core.postprocessor import NERPIINodePostprocessor
from llama_index.core.schema import TextNode
from llama_index.core.schema import NodeWithScore

Step 2: Creating TextNode Objects

Following the imports, we proceed to create a TextNode object utilizing our pattern textual content.

text_node = TextNode(textual content=textual content)

Step 3: Put up-processing Delicate Entities

Subsequently, we create a NERPIINodePostprocessor object and apply it to our TextNode object to post-process and masks the delicate entities.

processor = NERPIINodePostprocessor()

new_nodes = processor.postprocess_nodes(
    [NodeWithScore(node=text_node)]
)

Step 4: Reviewing Put up-Processed Textual content and PII Entity Mapping

After finishing the post-processing of our textual content, we will now study the post-processed textual content alongside the PII entity mapping.

pprint(new_nodes[0].node.get_content())

# OUTPUT
# 'Hello there! You may name me [PER_26]. Attain out at [email protected], '
# "and you will find me strolling the streets of [LOC_122]. My plastic good friend, "
# 'the [ORG_153], reads 5300-1234-5678-9000. Ever vibed at a gig by [PER_215]? '
# "I am curious. As for my card, it has a restrict I might reasonably not disclose right here; "
# 'nevertheless, my financial institution particulars are as follows: AT611904300235473201. [PER_367] is '
# "the household title. Tracing my roots, I've received ancestors named Leopold "
# '[PER_367] and [PER_456]. Additionally, a fast FYI: I attempted to go to your web site, '
# 'however my IP (203.0.113.5) appears to be barred. I did, nevertheless, handle to put up a '
# 'visible at this hyperlink: [ORG_627].fi.')

pprint(new_nodes[0].node.metadata)

# OUTPUT
# {'__pii_node_info__': {'[LOC_122]': 'Vienna',
#                        '[ORG_153]': 'Mastercard',
#                        '[ORG_627]': 'MegaMovieMoments',
#                        '[PER_215]': 'Zsofia Kovacs',
#                        '[PER_26]': 'Max Turner',
#                        '[PER_367]': 'Turner',
#                        '[PER_437]': 'Leopold Turner',
#                        '[PER_456]': 'Elisabeth Baumgartner'}}

Assessing the Limitations of NERPIINodePostprocessor and Introduction to Presidio

Upon reviewing the outcomes, it’s evident that the postprocessor fails to masks extremely delicate entities reminiscent of bank card numbers, telephone numbers, and electronic mail addresses. This consequence deviates from our intention, as we aimed to masks all delicate entities together with names, addresses, bank card numbers, and electronic mail addresses.

Whereas the NERPIINodePostprocessor successfully masks Named Entities like individual and firm names, with their respective entity sort and depend, it proves insufficient for masking texts containing extremely delicate content material. Now that we perceive the performance of the NERPIINodePostprocessor and its limitations in masking delicate data, let’s assess the efficiency of Presidio on the identical textual content. We’ll discover Presidio’s performance first after which proceed with using Llama Index’s Presidio implementation.

Assessing the Limitations of NERPIINodePostprocessor and Introduction to Presidio

Importing Important Packages for Presidio Integration

To start, import the requisite packages. This consists of the AnalyzerEngine and AnonymizerEngine from Presidio. Moreover, import the PresidioPIINodePostprocessor, which serves because the Llama Index’s integration of Presidio.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from llama_index.postprocessor.presidio import PresidioPIINodePostprocessor

Initializing and Analyzing Textual content with the Analyzer Engine

Proceed by initializing the Analyzer Engine utilizing the checklist of supported languages. Set it to an inventory containing ‘en’ for the English language. This permits Presidio to find out the language of the textual content content material. Subsequently, make the most of the analyzer occasion to investigate the textual content.

analyzer = AnalyzerEngine(supported_languages=["en"])

outcomes = analyzer.analyze(textual content=textual content, language="en")

Beneath is the outcome after analyzing the textual content content material. It reveals the PII entity sort, its star and finish index within the string and the chance rating.

Initializing the Anonymizer Engine

After initializing the Analyzer Engine, proceed to initialize the Anonymizer Engine. This part will anonymize the unique textual content primarily based on the outcomes obtained from the Analyzer Engine.

engine = AnonymizerEngine()

new_text = engine.anonymize(textual content=textual content, analyzer_results=outcomes)

Beneath is the output from the anonymizer engine, showcasing the unique textual content with masked PII entities.

pprint(new_text.textual content)

# OUTPUT
#  "Hello there! You may name me <PERSON>. Attain out at <EMAIL_ADDRESS>, and you will "
#  'discover me strolling the streets of <LOCATION>. My plastic good friend, the '
#  "<IN_PAN>, reads <IN_PAN>5678-9000. Ever vibed at a gig by <PERSON>? I am "
#  "curious. As for my card, it has a restrict I might reasonably not disclose right here; "
#  'nevertheless, my financial institution particulars are as follows: AT611904300235473201. <PERSON> is '
#  "the household title. Tracing my roots, I've received ancestors named <PERSON> and "
#  '<PERSON>. Additionally, a fast FYI: I attempted to go to your web site, however my IP '
#  '(<IP_ADDRESS>) appears to be barred. I did, nevertheless, handle to put up a visible '
#  'at this hyperlink: <URL>.'

Additionally Learn: RAG Powered Doc QnA & Semantic Caching with Gemini Professional

Analyzing PII Masking with Presidio

Presidio successfully masks all PII entities by enclosing their entity sort inside ‘<‘ and ‘>’. Nevertheless, the masking lacks distinctive identifiers for entity gadgets. Right here, Llama Index integration enhances the method. The Presidio implementation of Llama Index not solely returns the masked textual content with entity sort counts but additionally offers a deanonymizer map for deanonymization. Let’s discover how you can make the most of these options.

First create a TextNode object utilizing the enter textual content.

text_node = TextNode(textual content=textual content)

Subsequent, create an occasion of PresidioPIINodePostprocessor and run the postprocessor on the TextNode.

processor = PresidioPIINodePostprocessor()

new_nodes = processor.postprocess_nodes(
    [NodeWithScore(node=text_node)]
)

Lastly, we get the masked textual content from the anonymizer together with the deanonymizer map.

pprint(new_nodes[0].node.get_content())

# OUTPUT
#  'Hello there! You may name me <PERSON_5>. Attain out at <EMAIL_ADDRESS_1>, and '
#  "you will discover me strolling the streets of <LOCATION_1>. My plastic good friend, the "
#  '<IN_PAN_2>, reads <IN_PAN_1>5678-9000. Ever vibed at a gig by <PERSON_4>? '
#  "I am curious. As for my card, it has a restrict I might reasonably not disclose right here; "
#  'nevertheless, my financial institution particulars are as follows: AT611904300235473201. <PERSON_3> is '
#  "the household title. Tracing my roots, I've received ancestors named <PERSON_2> and "
#  '<PERSON_1>. Additionally, a fast FYI: I attempted to go to your web site, however my IP '
#  '(<IP_ADDRESS_1>) appears to be barred. I did, nevertheless, handle to put up a visible '
#  'at this hyperlink: <URL_1>.'


pprint(new_nodes[0].metadata)

# OUTPUT
# {'__pii_node_info__': {'<EMAIL_ADDRESS_1>': '[email protected]',
#                        '<IN_PAN_1>': '5300-1234-',
#                        '<IN_PAN_2>': 'Mastercard',
#                        '<IP_ADDRESS_1>': '203.0.113.5',
#                        '<LOCATION_1>': 'Vienna',
#                        '<PERSON_1>': 'Elisabeth Baumgartner',
#                        '<PERSON_2>': 'Leopold Turner',
#                        '<PERSON_3>': 'Turner',
#                        '<PERSON_4>': 'Zsofia Kovacs',
#                        '<PERSON_5>': 'Max Turner',
#                        '<URL_1>': 'MegaMovieMoments.fi'}}

The masked textual content generated by PresidioPIINodePostprocessor successfully masks all PII entities, indicating their entity sort and depend. Moreover, it offers a deanonymizer map, facilitating the next deanonymization of the masked textual content.

Purposes and Limitations

By leveraging the PresidioPIINodePostprocessor device, we will seamlessly anonymize data inside our RAG pipeline, prioritizing person information privateness. Inside the RAG pipeline, it could actually function an information anonymizer throughout information ingestion, successfully masking delicate data. Equally, within the question pipeline, it could actually operate as a deanonymizer, permitting authenticated customers to entry delicate data whereas sustaining privateness. The deanonymizer map might be securely saved in a protected location, making certain the confidentiality of delicate information all through the method.

The PII anonymizer device finds utility in RAG pipelines coping with monetary paperwork or delicate person/group data, necessitating safety from unidentified or unauthorized entry. It ensures safe storage of anonymized doc contents throughout the vector retailer, even within the occasion of an information breach. Moreover, it proves worthwhile in RAG pipelines involving group or private emails, the place delicate information like addresses, password change URLs, and OTPs are prevalent, necessitating ingestion in an anonymized state.

Limitations

Whereas the PII detection device might be helpful in RAG pipelines, there are some limitations to implementing it into an RAG pipeline.

Including PII detection and masking can introduce extra processing time to the RAG pipeline, which can impression the general efficiency and latency of the system, particularly with massive datasets or when real-time processing is required.
No PII detection device is ideal; there might be cases of false positives, the place non-PII information is mistakenly masked, or false negatives, the place precise PII will not be detected. Each situations can have implications for person expertise and information safety efficacy.
Presidio might have limitations in understanding context and nuances throughout completely different languages, probably decreasing their effectiveness in precisely figuring out PII in multilingual datasets.
Whereas the PII anonymization device can masks delicate data precisely, the preliminary ingestion of knowledge nonetheless requires cautious dealing with. If a breach happens earlier than the info is anonymized, delicate data could possibly be uncovered.
In circumstances the place anonymization must be reversible, sustaining safe and managed entry to deanonymization keys or maps is vital, and failure to take action might compromise the integrity of the anonymization course of.

Conclusion

In conclusion, the incorporation of PII detection and masking instruments like Presidio into RAG pipelines marks a notable stride in AI’s capability to deal with delicate information whereas upholding particular person privateness. By way of the utilization of superior methods and customizable options, Presidio elevates the safety and flexibility of textual content era, assembly the escalating want for information privateness within the digital period. Regardless of potential challenges reminiscent of latency and accuracy, the benefits of safeguarding person information with refined anonymization instruments are simple, positioning it as an important ingredient for accountable AI improvement and deployment.

Key Takeaways

With the rising use of AI and massive information, the necessity to shield Personally Identifiable Info (PII) in any system that processes person information is vital.
Retrieval Augmented Technology (RAG) programs, which mix data retrieval with language era, can probably expose PII. Due to this fact, incorporating PII detection and masking mechanisms is crucial to keep up privateness requirements.
Microsoft’s Presidio provides sturdy PII detection and anonymization capabilities, making it an appropriate selection for integrating into RAG pipelines. It offers predefined and customizable PII detectors, leveraging NER, Common Expressions, and checksum.
Presidio is most popular over primary NER PII post-processing instruments resulting from its refined anonymization options, flexibility, and better accuracy in detecting a variety of PII entities.
The PII anonymization device is especially helpful in RAG pipelines coping with monetary paperwork, delicate organizational information, and emails, making certain that non-public data will not be uncovered to unauthorized customers.

The media proven on this article will not be owned by Analytics Vidhya and is used on the Writer’s discretion.

Supply hyperlink