Why RAG Pipelines Break When You Naively Anonymize

Retrieval-Augmented Generation (RAG) depends on semantic similarity between a user's query and indexed document chunks. When a user asks "What did Maria Rossi order last month?", the retriever searches for chunks containing semantically relevant content — and the LLM synthesizes an answer from the retrieved context.

The problem arises at ingestion time. If you anonymize documents before embedding them, and your anonymization is not consistent, you break the semantic graph:

  • Document A: "Maria Rossi" → [PERSON_1]
  • Document B: "Maria Rossi" → [PERSON_3]
  • Document C: "M. Rossi" → [PERSON_7]

Now the three documents that should co-retrieve together are indexed as referring to three different people. Your retrieval recall collapses. The LLM synthesizes incomplete answers from only a fraction of the relevant context.

The Problem: Inconsistent Anonymization Destroys Retrieval

Standard anonymization tools are designed for single-document, single-session use. They assign placeholder tokens sequentially ([PERSON_1], [PERSON_2], ...) without any persistence. Run the same document through the tool twice and you may get different token assignments.

For RAG pipelines, this means:

  • Cross-document retrieval fails — the same entity has different tokens across documents, so similarity searches miss related chunks.
  • Incremental indexing breaks — new documents added to the index use a fresh token sequence, creating collisions with existing tokens (a new [PERSON_1] may refer to a completely different person).
  • Query-time anonymization mismatches — if you anonymize the user's query before retrieval, the query tokens must match the index tokens exactly.

The Solution: Consistent Pseudonymization

Consistent pseudonymization uses a deterministic mapping: the same input entity always produces the same output token. The mapping is stored in a persistent table (a database or key-value store) that is reused across all documents and sessions.

There are two implementation approaches:

Approach 1: Hash-Based Tokens

Compute a keyed hash of the entity value. The hash is always the same for the same input, so two occurrences of "Maria Rossi" produce identical tokens. Uses a secret key (HMAC) to prevent reverse-engineering.

import hmac, hashlib

HMAC_KEY = b"your-secret-key-32-bytes-minimum"

def consistent_token(entity_type: str, entity_value: str) -> str:
    """Produce a stable, deterministic token for any entity value."""
    digest = hmac.new(
        HMAC_KEY,
        f"{entity_type}:{entity_value.strip().lower()}".encode(),
        hashlib.sha256
    ).hexdigest()[:8]
    return f"[{entity_type}_{digest}]"

# Examples:
# consistent_token("PERSON", "Maria Rossi") -> "[PERSON_3a7f9c2b]"  (always)
# consistent_token("PERSON", "maria rossi") -> "[PERSON_3a7f9c2b]"  (same)
# consistent_token("PERSON", "John Smith")  -> "[PERSON_b1e4a8d6]"  (different)

Approach 2: Lookup Table with Sequential Assignment

Maintain a persistent dictionary mapping entity values to assigned tokens. On first encounter, assign the next available token for that type; on subsequent encounters, return the existing mapping.

import json, pathlib

class PersistentMapping:
    def __init__(self, path: str):
        self.path = pathlib.Path(path)
        self.data = json.loads(self.path.read_text()) if self.path.exists() else {}
        self.counters = {}

    def get_token(self, entity_type: str, entity_value: str) -> str:
        key = f"{entity_type}:{entity_value.strip().lower()}"
        if key not in self.data:
            count = self.counters.get(entity_type, 0) + 1
            self.counters[entity_type] = count
            self.data[key] = f"[{entity_type}_{count}]"
            self.path.write_text(json.dumps(self.data))
        return self.data[key]

    def reverse(self, token: str) -> str | None:
        reverse_map = {v: k.split(":", 1)[1] for k, v in self.data.items()}
        return reverse_map.get(token)

mapping = PersistentMapping("entity_mapping.json")
print(mapping.get_token("PERSON", "Maria Rossi"))  # [PERSON_1]
print(mapping.get_token("PERSON", "Maria Rossi"))  # [PERSON_1] (same)
print(mapping.get_token("PERSON", "John Smith"))   # [PERSON_2]

Using the anonymize.solutions API for Consistent Mapping

The anonymize.solutions API supports consistent pseudonymization natively via the session_id parameter. Pass the same session_id across all documents in your corpus, and the API maintains a persistent mapping for that session:

import requests

API_KEY = "your-api-key"
BASE_URL = "https://api.anonymize.solutions/v1"
CORPUS_SESSION = "rag-corpus-v1"  # Stable ID for your corpus

def anonymize_for_rag(text: str) -> dict:
    response = requests.post(
        f"{BASE_URL}/anonymize",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "text": text,
            "preset": "gdpr",
            "session_id": CORPUS_SESSION,   # Ensures consistent tokens
            "consistent": True              # Enable cross-session consistency
        }
    )
    return response.json()

result = anonymize_for_rag("Maria Rossi ordered from Berlin on Jan 15.")
# Returns: {"text": "[PERSON_1] ordered from [LOCATION_1] on [DATE_1].",
#           "session_id": "rag-corpus-v1", "mapping": {...}}

Entity Types That Matter in RAG

Not all entity types affect retrieval equally. Focus consistent pseudonymization on entities that appear as query terms or cross-document references:

Entity Type Impact on Retrieval Recommendation
Person names Very high — most common query term Always use consistent pseudonymization
Organizations High — frequent in business RAG Consistent pseudonymization
Locations / Addresses Medium — depends on use case Consistent for city/country; redact street addresses
IDs (account, medical record) High — exact match queries Always consistent — ID lookups require exact match
Dates Low-medium — relative dates matter Redact specific dates; preserve relative time references
Email / Phone Low — rarely queried Redact completely; not needed for retrieval

The Decrypt Workflow: Restoring Original Values After Retrieval

When the RAG pipeline returns an answer containing pseudonymized tokens, you need to restore the original values before showing the response to the user. This is the de-anonymization step:

def deanonymize_rag_response(
    anonymized_response: str,
    session_id: str
) -> str:
    response = requests.post(
        f"{BASE_URL}/deanonymize",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "text": anonymized_response,
            "session_id": session_id
        }
    )
    return response.json()["text"]

# LLM returned: "[PERSON_1] placed order #[ID_1] on [DATE_1]."
original = deanonymize_rag_response(
    "[PERSON_1] placed order #[ID_1] on [DATE_1].",
    CORPUS_SESSION
)
# Returns: "Maria Rossi placed order #ORD-78234 on January 15, 2026."

The de-anonymization step should only run for authorized users. Implement role-based access control: analysts can query the anonymized index but only authorized data owners can see de-anonymized responses.

Compliance: GDPR Recital 26 and EU AI Act Article 10

GDPR Recital 26 states that anonymized data — data that cannot be attributed to an identified or identifiable person — falls outside the scope of GDPR. Consistent pseudonymization, where the mapping table is held separately and access-controlled, comes close to this standard. Under GDPR, pseudonymized data is still personal data (because re-identification is possible with the key), but it benefits from reduced obligations and is recognized as a security measure under Article 32.

EU AI Act Article 10 requires that data used in high-risk AI systems be subject to appropriate data governance practices, including measures to detect and correct biases and to ensure relevance and representativeness. Storing RAG corpora in consistently pseudonymized form — with a separate, access-controlled mapping table — satisfies the "appropriate technical measures" requirement for data governance.

Performance: Sub-Millisecond Overhead, Batch Processing

Consistent pseudonymization adds minimal latency to your ingestion pipeline:

  • API latency: 5–25ms per document (network-bound, not compute-bound)
  • Batch mode: Send up to 100 documents per API call, reducing per-document overhead to under 1ms
  • Local caching: After the first call, cache the session mapping locally. Subsequent documents only hit the API for new entity values.
  • Embedding overhead: Anonymized text has slightly different semantic properties than the original. Use the same embedding model throughout; do not mix embeddings of anonymized and non-anonymized text.
# Batch anonymization example
documents = [
    "Maria Rossi ordered product A on January 15.",
    "John Smith requested a refund for order ORD-78234.",
    "Maria Rossi followed up about her January order.",
]

batch_response = requests.post(
    f"{BASE_URL}/anonymize/batch",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "texts": documents,
        "preset": "gdpr",
        "session_id": CORPUS_SESSION,
        "consistent": True
    }
)
# All three docs share the same mapping:
# "Maria Rossi" -> "[PERSON_1]" in docs 1 and 3
# "John Smith"  -> "[PERSON_2]" in doc 2

Related Articles

🔨

MCP Server PII Protection Guide

6 MCP operators for anonymizing AI workflows in Claude Desktop, Cursor, and VS Code.

Read More →
🔒

PII in LLM Prompts

What happens when PII enters a large language model — and the technical controls that prevent it.

Read More →
📈

Anonymization vs Pseudonymization

The legal and technical difference — and when each is the right choice under GDPR.

Read More →

Build Privacy-Preserving RAG Pipelines

Consistent pseudonymization for RAG, document ingestion, and AI workflows. Sub-millisecond overhead, batch mode, GDPR-compliant by design.