Definition
The process of identifying and masking Personally Identifiable Information (PII) within source documents or user queries before they are ingested into a vector database or processed by an LLM. In RAG pipelines, this ensures that sensitive data is not stored in embeddings or exposed to third-party model providers while maintaining the semantic integrity required for retrieval.
In AI, this refers to preserving semantic context for the model while removing identity, rather than just stripping characters or encrypting fields.
"A high-fidelity photocopy of a medical record where names and dates are covered by generic labels like '[PATIENT_ID]' so a doctor can still diagnose the case without knowing who the patient is."
- PII (Personally Identifiable Information)(Prerequisite)
- Semantic Preservation(Component)
- Vector Database(Component)
- Data Utility vs. Privacy Trade-off(Architectural Consideration)
Conceptual Overview
The process of identifying and masking Personally Identifiable Information (PII) within source documents or user queries before they are ingested into a vector database or processed by an LLM. In RAG pipelines, this ensures that sensitive data is not stored in embeddings or exposed to third-party model providers while maintaining the semantic integrity required for retrieval.
Disambiguation
In AI, this refers to preserving semantic context for the model while removing identity, rather than just stripping characters or encrypting fields.
Visual Analog
A high-fidelity photocopy of a medical record where names and dates are covered by generic labels like '[PATIENT_ID]' so a doctor can still diagnose the case without knowing who the patient is.