Why RAG Pipelines Break When You Naively Anonymize
Retrieval-Augmented Generation (RAG) depends on semantic similarity between a user's query and indexed document chunks. When a user asks "What did Maria Rossi order last month?", the retriever searches for chunks containing semantically relevant content — and the LLM synthesizes an answer from the retrieved context.
The problem arises at ingestion time. If you anonymize documents before embedding them, and your anonymization is not consistent, you break the semantic graph:
- Document A: "Maria Rossi" →
[PERSON_1] - Document B: "Maria Rossi" →
[PERSON_3] - Document C: "M. Rossi" →
[PERSON_7]
Now the three documents that should co-retrieve together are indexed as referring to three different people. Your retrieval recall collapses. The LLM synthesizes incomplete answers from only a fraction of the relevant context.
The Problem: Inconsistent Anonymization Destroys Retrieval
Standard anonymization tools are designed for single-document, single-session use. They assign placeholder tokens sequentially ([PERSON_1], [PERSON_2], ...) without any persistence. Run the same document through the tool twice and you may get different token assignments.
For RAG pipelines, this means:
- Cross-document retrieval fails — the same entity has different tokens across documents, so similarity searches miss related chunks.
- Incremental indexing breaks — new documents added to the index use a fresh token sequence, creating collisions with existing tokens (a new
[PERSON_1]may refer to a completely different person). - Query-time anonymization mismatches — if you anonymize the user's query before retrieval, the query tokens must match the index tokens exactly.
The Solution: Consistent Pseudonymization
Consistent pseudonymization uses a deterministic mapping: the same input entity always produces the same output token. The mapping is stored in a persistent table (a database or key-value store) that is reused across all documents and sessions.
There are two implementation approaches:
Approach 1: Hash-Based Tokens
Compute a keyed hash of the entity value. The hash is always the same for the same input, so two occurrences of "Maria Rossi" produce identical tokens. Uses a secret key (HMAC) to prevent reverse-engineering.
import hmac, hashlib
HMAC_KEY = b"your-secret-key-32-bytes-minimum"
def consistent_token(entity_type: str, entity_value: str) -> str:
"""Produce a stable, deterministic token for any entity value."""
digest = hmac.new(
HMAC_KEY,
f"{entity_type}:{entity_value.strip().lower()}".encode(),
hashlib.sha256
).hexdigest()[:8]
return f"[{entity_type}_{digest}]"
# Examples:
# consistent_token("PERSON", "Maria Rossi") -> "[PERSON_3a7f9c2b]" (always)
# consistent_token("PERSON", "maria rossi") -> "[PERSON_3a7f9c2b]" (same)
# consistent_token("PERSON", "John Smith") -> "[PERSON_b1e4a8d6]" (different)Approach 2: Lookup Table with Sequential Assignment
Maintain a persistent dictionary mapping entity values to assigned tokens. On first encounter, assign the next available token for that type; on subsequent encounters, return the existing mapping.
import json, pathlib
class PersistentMapping:
def __init__(self, path: str):
self.path = pathlib.Path(path)
self.data = json.loads(self.path.read_text()) if self.path.exists() else {}
self.counters = {}
def get_token(self, entity_type: str, entity_value: str) -> str:
key = f"{entity_type}:{entity_value.strip().lower()}"
if key not in self.data:
count = self.counters.get(entity_type, 0) + 1
self.counters[entity_type] = count
self.data[key] = f"[{entity_type}_{count}]"
self.path.write_text(json.dumps(self.data))
return self.data[key]
def reverse(self, token: str) -> str | None:
reverse_map = {v: k.split(":", 1)[1] for k, v in self.data.items()}
return reverse_map.get(token)
mapping = PersistentMapping("entity_mapping.json")
print(mapping.get_token("PERSON", "Maria Rossi")) # [PERSON_1]
print(mapping.get_token("PERSON", "Maria Rossi")) # [PERSON_1] (same)
print(mapping.get_token("PERSON", "John Smith")) # [PERSON_2]Using the anonymize.solutions API for Consistent Mapping
The anonymize.solutions API supports consistent pseudonymization natively via the session_id parameter. Pass the same session_id across all documents in your corpus, and the API maintains a persistent mapping for that session:
import requests
API_KEY = "your-api-key"
BASE_URL = "https://api.anonymize.solutions/v1"
CORPUS_SESSION = "rag-corpus-v1" # Stable ID for your corpus
def anonymize_for_rag(text: str) -> dict:
response = requests.post(
f"{BASE_URL}/anonymize",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"text": text,
"preset": "gdpr",
"session_id": CORPUS_SESSION, # Ensures consistent tokens
"consistent": True # Enable cross-session consistency
}
)
return response.json()
result = anonymize_for_rag("Maria Rossi ordered from Berlin on Jan 15.")
# Returns: {"text": "[PERSON_1] ordered from [LOCATION_1] on [DATE_1].",
# "session_id": "rag-corpus-v1", "mapping": {...}}Entity Types That Matter in RAG
Not all entity types affect retrieval equally. Focus consistent pseudonymization on entities that appear as query terms or cross-document references:
| Entity Type | Impact on Retrieval | Recommendation |
|---|---|---|
| Person names | Very high — most common query term | Always use consistent pseudonymization |
| Organizations | High — frequent in business RAG | Consistent pseudonymization |
| Locations / Addresses | Medium — depends on use case | Consistent for city/country; redact street addresses |
| IDs (account, medical record) | High — exact match queries | Always consistent — ID lookups require exact match |
| Dates | Low-medium — relative dates matter | Redact specific dates; preserve relative time references |
| Email / Phone | Low — rarely queried | Redact completely; not needed for retrieval |
The Decrypt Workflow: Restoring Original Values After Retrieval
When the RAG pipeline returns an answer containing pseudonymized tokens, you need to restore the original values before showing the response to the user. This is the de-anonymization step:
def deanonymize_rag_response(
anonymized_response: str,
session_id: str
) -> str:
response = requests.post(
f"{BASE_URL}/deanonymize",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"text": anonymized_response,
"session_id": session_id
}
)
return response.json()["text"]
# LLM returned: "[PERSON_1] placed order #[ID_1] on [DATE_1]."
original = deanonymize_rag_response(
"[PERSON_1] placed order #[ID_1] on [DATE_1].",
CORPUS_SESSION
)
# Returns: "Maria Rossi placed order #ORD-78234 on January 15, 2026."The de-anonymization step should only run for authorized users. Implement role-based access control: analysts can query the anonymized index but only authorized data owners can see de-anonymized responses.
Compliance: GDPR Recital 26 and EU AI Act Article 10
GDPR Recital 26 states that anonymized data — data that cannot be attributed to an identified or identifiable person — falls outside the scope of GDPR. Consistent pseudonymization, where the mapping table is held separately and access-controlled, comes close to this standard. Under GDPR, pseudonymized data is still personal data (because re-identification is possible with the key), but it benefits from reduced obligations and is recognized as a security measure under Article 32.
EU AI Act Article 10 requires that data used in high-risk AI systems be subject to appropriate data governance practices, including measures to detect and correct biases and to ensure relevance and representativeness. Storing RAG corpora in consistently pseudonymized form — with a separate, access-controlled mapping table — satisfies the "appropriate technical measures" requirement for data governance.
Performance: Sub-Millisecond Overhead, Batch Processing
Consistent pseudonymization adds minimal latency to your ingestion pipeline:
- API latency: 5–25ms per document (network-bound, not compute-bound)
- Batch mode: Send up to 100 documents per API call, reducing per-document overhead to under 1ms
- Local caching: After the first call, cache the session mapping locally. Subsequent documents only hit the API for new entity values.
- Embedding overhead: Anonymized text has slightly different semantic properties than the original. Use the same embedding model throughout; do not mix embeddings of anonymized and non-anonymized text.
# Batch anonymization example
documents = [
"Maria Rossi ordered product A on January 15.",
"John Smith requested a refund for order ORD-78234.",
"Maria Rossi followed up about her January order.",
]
batch_response = requests.post(
f"{BASE_URL}/anonymize/batch",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"texts": documents,
"preset": "gdpr",
"session_id": CORPUS_SESSION,
"consistent": True
}
)
# All three docs share the same mapping:
# "Maria Rossi" -> "[PERSON_1]" in docs 1 and 3
# "John Smith" -> "[PERSON_2]" in doc 2