95%+
Structured Data Accuracy
317
Deterministic Regex Recognizers
260+
Entity Types Detected
48
Supported Languages

Deterministic & Reproducible

No LLMs in the detection pipeline. Pattern matching and NLP models produce identical results on identical inputs, every time. The same document always yields the same detections.

Pattern Matching First

317 deterministic regex recognizers handle structured data: credit cards (Luhn validation), IBANs (MOD-97 checksum), tax IDs, passport numbers, and more. Checksum validation ensures very high accuracy (95%+) for these formats.

NLP for Context-Dependent Data

Names, organizations, and locations require context understanding. Three NLP (Natural Language Processing) engines — spaCy (open-source NLP library), Stanza (Stanford NLP toolkit), and XLM-RoBERTa (cross-lingual transformer model) — provide named entity recognition across 48 languages. Accuracy varies by language and entity type — English detection is the most mature.

Confidence Scoring

Every detection carries a confidence score from 0.0 to 1.0. High confidence (0.85–1.0) indicates strong format match with supporting context. Medium confidence (0.5–0.85) flags likely correct matches for review.

Detection Quality by Data Type

Detection accuracy depends on the type of data. Structured formats with checksum validation achieve very high accuracy. Context-dependent entities like names vary by language.

Data Category Detection Method Accuracy Characteristic Examples
Structured with Checksum Regex + Checksum Validation Very high (95%+) Credit cards (Luhn), IBANs (MOD-97), SSNs, Tax IDs
Structured without Checksum Regex + Context Words High Phone numbers, postal codes, IP addresses, dates
Names & Organizations NLP (spaCy/Stanza/XLM-RoBERTa) Variable by language Person names, company names, locations
Mixed Documents Hybrid (Regex + NLP) Highest overall coverage Contracts, emails, medical records

Note: Structured data with checksum validation (credit cards, IBANs, tax IDs) achieves the highest accuracy because format and checksum rules are deterministic. NLP-detected entities (names, locations) have accuracy that varies by language and context — English detection is the most mature.

Three NLP Engines, Automatic Selection

Each NLP engine covers a specific set of languages. The system automatically selects the right engine based on detected language. No overlap — each language belongs to exactly one engine.

spaCy (Primary)

Open-source NLP library. Primary engine for the majority of supported languages. Trained models for named entity recognition with high consistency.

  • 25 languages
  • 95%+ accuracy
  • Best for: names, organizations, locations
  • Fastest inference speed

Stanza (Secondary)

Stanford NLP toolkit. Secondary engine for languages where spaCy models are unavailable. Research-grade accuracy.

  • 7 languages
  • 90%+ accuracy
  • Best for: languages not covered by spaCy
  • Stanford research models

XLM-RoBERTa (Fallback)

Cross-lingual transformer model by Meta. Fallback engine for remaining languages. Cross-lingual transfer learning enables detection in languages without dedicated training data.

  • 16 languages
  • 85%+ accuracy
  • Best for: low-resource languages, RTL scripts
  • Cross-lingual transfer

All accuracy percentages are per-engine characteristics documented in the official product documentation. Regex-based detection for structured data (emails, SSNs, credit cards, IBANs) operates independently of NLP engines and produces 100% reproducible results.

How We Compare

Feature comparison against common PII detection solutions, based on publicly available documentation.

Capability anonymize.solutions Presidio DIY Google DLP Strac Nightfall
Entity Types 260+ ~50 default ~120 ~50 ~100
Multi-Language Support 48 languages Depends on setup ~50 languages Limited ~25 languages
Detection Approach Regex + NLP (deterministic) Regex + NLP ML-based ML-based ML-based
False Positive Handling Checksum + confidence scoring Manual tuning Likelihood levels Confidence thresholds Confidence scores
Deterministic Output Yes Yes No No No
EU Data Residency 100% EU Self-hosted US default US only US only
Managed Service Yes (SaaS + Private) No (DIY) Yes Yes Yes

Note: Competitor information is based on publicly available documentation as of Q1 2026. Entity counts and language support are approximate and subject to change. See individual comparison pages for detailed analysis.

Confidence Scoring & Context-Aware Validation

False positives erode trust. A system that flags “IBAN: DE89370400440532013000” but also flags “December 2025” as a credit card is worse than no system at all. Multi-layered validation reduces noise.

Checksum Validation

IBANs are validated with MOD-97, credit cards with Luhn algorithm, SSNs against format rules and known invalid ranges. A sequence of digits that looks like a credit card but fails Luhn is rejected before it reaches the output.

Confidence Scoring (0.0–1.0)

Every detection carries a confidence score. High (0.85–1.0): strong format match with context. Medium (0.5–0.85): likely correct, review recommended. Low (0.3–0.5): generic pattern, manual review. Configurable thresholds let you tune the precision/recall tradeoff.

Context-Aware Disambiguation

The NLP engine considers surrounding words. “Jordan” after “Michael” is a person; “Jordan” after “flew to” is a location. Context eliminates ambiguity that pattern-only systems cannot resolve.

Language-Specific Handling

Each language has unique patterns that affect detection quality. German compound nouns, Arabic name patterns, Japanese honorifics — dedicated NLP models (spaCy for 25 languages, Stanza for 7, XLM-RoBERTa for 16) handle language-specific challenges. Automatic language detection selects the right engine with English as the fallback.

Built on Microsoft Presidio

The detection pipeline is built on Microsoft Presidio, an MIT-licensed open-source PII detection framework. We extend Presidio with 317 custom regex recognizers covering 75+ countries, additional NLP engines beyond the default, and managed hosting with Zero-Knowledge architecture.

See accuracy in action

Run our detection engines against your own data and verify the numbers yourself. No commitment, no credit card.