Accuracy by Design
Our detection pipeline combines deterministic pattern matching with NLP models to achieve high accuracy across structured and unstructured data. Every detection includes a confidence score so you can verify results.
Deterministic & Reproducible
No LLMs in the detection pipeline. Pattern matching and NLP models produce identical results on identical inputs, every time. The same document always yields the same detections.
Pattern Matching First
317 deterministic regex recognizers handle structured data: credit cards (Luhn validation), IBANs (MOD-97 checksum), tax IDs, passport numbers, and more. Checksum validation ensures very high accuracy (95%+) for these formats.
NLP for Context-Dependent Data
Names, organizations, and locations require context understanding. Three NLP (Natural Language Processing) engines — spaCy (open-source NLP library), Stanza (Stanford NLP toolkit), and XLM-RoBERTa (cross-lingual transformer model) — provide named entity recognition across 48 languages. Accuracy varies by language and entity type — English detection is the most mature.
Confidence Scoring
Every detection carries a confidence score from 0.0 to 1.0. High confidence (0.85–1.0) indicates strong format match with supporting context. Medium confidence (0.5–0.85) flags likely correct matches for review.
Detection Quality by Data Type
Detection accuracy depends on the type of data. Structured formats with checksum validation achieve very high accuracy. Context-dependent entities like names vary by language.
| Data Category | Detection Method | Accuracy Characteristic | Examples |
|---|---|---|---|
| Structured with Checksum | Regex + Checksum Validation | Very high (95%+) | Credit cards (Luhn), IBANs (MOD-97), SSNs, Tax IDs |
| Structured without Checksum | Regex + Context Words | High | Phone numbers, postal codes, IP addresses, dates |
| Names & Organizations | NLP (spaCy/Stanza/XLM-RoBERTa) | Variable by language | Person names, company names, locations |
| Mixed Documents | Hybrid (Regex + NLP) | Highest overall coverage | Contracts, emails, medical records |
Note: Structured data with checksum validation (credit cards, IBANs, tax IDs) achieves the highest accuracy because format and checksum rules are deterministic. NLP-detected entities (names, locations) have accuracy that varies by language and context — English detection is the most mature.
Three NLP Engines, Automatic Selection
Each NLP engine covers a specific set of languages. The system automatically selects the right engine based on detected language. No overlap — each language belongs to exactly one engine.
spaCy (Primary)
Open-source NLP library. Primary engine for the majority of supported languages. Trained models for named entity recognition with high consistency.
- 25 languages
- 95%+ accuracy
- Best for: names, organizations, locations
- Fastest inference speed
Stanza (Secondary)
Stanford NLP toolkit. Secondary engine for languages where spaCy models are unavailable. Research-grade accuracy.
- 7 languages
- 90%+ accuracy
- Best for: languages not covered by spaCy
- Stanford research models
XLM-RoBERTa (Fallback)
Cross-lingual transformer model by Meta. Fallback engine for remaining languages. Cross-lingual transfer learning enables detection in languages without dedicated training data.
- 16 languages
- 85%+ accuracy
- Best for: low-resource languages, RTL scripts
- Cross-lingual transfer
All accuracy percentages are per-engine characteristics documented in the official product documentation. Regex-based detection for structured data (emails, SSNs, credit cards, IBANs) operates independently of NLP engines and produces 100% reproducible results.
How We Compare
Feature comparison against common PII detection solutions, based on publicly available documentation.
| Capability | anonymize.solutions | Presidio DIY | Google DLP | Strac | Nightfall |
|---|---|---|---|---|---|
| Entity Types | 260+ | ~50 default | ~120 | ~50 | ~100 |
| Multi-Language Support | 48 languages | Depends on setup | ~50 languages | Limited | ~25 languages |
| Detection Approach | Regex + NLP (deterministic) | Regex + NLP | ML-based | ML-based | ML-based |
| False Positive Handling | Checksum + confidence scoring | Manual tuning | Likelihood levels | Confidence thresholds | Confidence scores |
| Deterministic Output | Yes | Yes | No | No | No |
| EU Data Residency | 100% EU | Self-hosted | US default | US only | US only |
| Managed Service | Yes (SaaS + Private) | No (DIY) | Yes | Yes | Yes |
Note: Competitor information is based on publicly available documentation as of Q1 2026. Entity counts and language support are approximate and subject to change. See individual comparison pages for detailed analysis.
Confidence Scoring & Context-Aware Validation
False positives erode trust. A system that flags “IBAN: DE89370400440532013000” but also flags “December 2025” as a credit card is worse than no system at all. Multi-layered validation reduces noise.
Checksum Validation
IBANs are validated with MOD-97, credit cards with Luhn algorithm, SSNs against format rules and known invalid ranges. A sequence of digits that looks like a credit card but fails Luhn is rejected before it reaches the output.
Confidence Scoring (0.0–1.0)
Every detection carries a confidence score. High (0.85–1.0): strong format match with context. Medium (0.5–0.85): likely correct, review recommended. Low (0.3–0.5): generic pattern, manual review. Configurable thresholds let you tune the precision/recall tradeoff.
Context-Aware Disambiguation
The NLP engine considers surrounding words. “Jordan” after “Michael” is a person; “Jordan” after “flew to” is a location. Context eliminates ambiguity that pattern-only systems cannot resolve.
Language-Specific Handling
Each language has unique patterns that affect detection quality. German compound nouns, Arabic name patterns, Japanese honorifics — dedicated NLP models (spaCy for 25 languages, Stanza for 7, XLM-RoBERTa for 16) handle language-specific challenges. Automatic language detection selects the right engine with English as the fallback.
Built on Microsoft Presidio
The detection pipeline is built on Microsoft Presidio, an MIT-licensed open-source PII detection framework. We extend Presidio with 317 custom regex recognizers covering 75+ countries, additional NLP engines beyond the default, and managed hosting with Zero-Knowledge architecture.
See accuracy in action
Run our detection engines against your own data and verify the numbers yourself. No commitment, no credit card.