PII Detection Accuracy — Engines & Confidence Scoring

95%+

Structured Data Accuracy

317

Deterministic Regex Recognizers

320+

Entity Types Detected

Supported Languages

HOW WE DETECT

Deterministic & Reproducible

No LLMs in the detection pipeline. Pattern matching and NLP models produce identical results on identical inputs, every time. The same document always yields the same detections.

Pattern Matching First

317 deterministic regex recognizers handle structured data: credit cards (Luhn validation), IBANs (MOD-97 checksum), tax IDs, passport numbers, and more. Checksum validation ensures very high accuracy (95%+) for these formats.

NLP for Context-Dependent Data

Names, organizations, and locations require context understanding. Three NLP (Natural Language Processing) engines — spaCy (open-source NLP library), Stanza (Stanford NLP toolkit), and XLM-RoBERTa (cross-lingual transformer model) — provide named entity recognition across 48 languages. Accuracy varies by language and entity type — English detection is the most mature.

Confidence Scoring

Every detection carries a confidence score from 0.0 to 1.0. High confidence (0.85–1.0) indicates strong format match with supporting context. Medium confidence (0.5–0.85) flags likely correct matches for review.

ACCURACY PROFILES

Detection Quality by Data Type

Detection accuracy depends on the type of data. Structured formats with checksum validation achieve very high accuracy. Context-dependent entities like names vary by language.

Data Category	Detection Method	Accuracy Characteristic	Examples
Structured with Checksum	Regex + Checksum Validation	Very high (95%+)	Credit cards (Luhn), IBANs (MOD-97), SSNs, Tax IDs
Structured without Checksum	Regex + Context Words	High	Phone numbers, postal codes, IP addresses, dates
Names & Organizations	NLP (spaCy/Stanza/XLM-RoBERTa)	Variable by language	Person names, company names, locations
Mixed Documents	Hybrid (Regex + NLP)	Highest overall coverage	Contracts, emails, medical records

Note: Structured data with checksum validation (credit cards, IBANs, tax IDs) achieves the highest accuracy because format and checksum rules are deterministic. NLP-detected entities (names, locations) have accuracy that varies by language and context — English detection is the most mature.

ENGINE PROFILES

Three NLP Engines, Automatic Selection

Each NLP engine covers a specific set of languages. The system automatically selects the right engine based on detected language. No overlap — each language belongs to exactly one engine.

spaCy (Primary)

Open-source NLP library. Primary engine for the majority of supported languages. Trained models for named entity recognition with high consistency.

25 languages
95%+ accuracy
Best for: names, organizations, locations
Fastest inference speed

Stanza (Secondary)

Stanford NLP toolkit. Secondary engine for languages where spaCy models are unavailable. Research-grade accuracy.

7 languages
90%+ accuracy
Best for: languages not covered by spaCy
Stanford research models

XLM-RoBERTa (Fallback)

Cross-lingual transformer model by Meta. Fallback engine for remaining languages. Cross-lingual transfer learning enables detection in languages without dedicated training data.

16 languages
85%+ accuracy
Best for: low-resource languages, RTL scripts
Cross-lingual transfer

All accuracy percentages are per-engine characteristics documented in the official product documentation. Regex-based detection for structured data (emails, SSNs, credit cards, IBANs) operates independently of NLP engines and produces 100% reproducible results.

COMPARISON

How We Compare

Feature comparison against common PII detection solutions, based on publicly available documentation.

Capability	anonymize.solutions	Presidio DIY	Google DLP	Strac	Nightfall
Entity Types	320+	~50 default	~120	~50	~100
Multi-Language Support	48 languages	Depends on setup	~50 languages	Limited	~25 languages
Detection Approach	Regex + NLP (deterministic)	Regex + NLP	ML-based	ML-based	ML-based
False Positive Handling	Checksum + confidence scoring	Manual tuning	Likelihood levels	Confidence thresholds	Confidence scores
Deterministic Output	Yes	Yes	No	No	No
EU Data Residency	100% EU	Self-hosted	US default	US only	US only
Managed Service	Yes (SaaS + Private)	No (DIY)	Yes	Yes	Yes

Note: Competitor information is based on publicly available documentation as of Q1 2026. Entity counts and language support are approximate and subject to change. See individual comparison pages for detailed analysis.

QUALITY ASSURANCE

Confidence Scoring & Context-Aware Validation

False positives erode trust. A system that flags “IBAN: DE89370400440532013000” but also flags “December 2025” as a credit card is worse than no system at all. Multi-layered validation reduces noise.

Checksum Validation

IBANs are validated with MOD-97, credit cards with Luhn algorithm, SSNs against format rules and known invalid ranges. A sequence of digits that looks like a credit card but fails Luhn is rejected before it reaches the output.

Confidence Scoring (0.0–1.0)

Every detection carries a confidence score. High (0.85–1.0): strong format match with context. Medium (0.5–0.85): likely correct, review recommended. Low (0.3–0.5): generic pattern, manual review. Configurable thresholds let you tune the precision/recall tradeoff.

Context-Aware Disambiguation

The NLP engine considers surrounding words. “Jordan” after “Michael” is a person; “Jordan” after “flew to” is a location. Context eliminates ambiguity that pattern-only systems cannot resolve.

Language-Specific Handling

Each language has unique patterns that affect detection quality. German compound nouns, Arabic name patterns, Japanese honorifics — dedicated NLP models (spaCy for 25 languages, Stanza for 7, XLM-RoBERTa for 16) handle language-specific challenges. Automatic language detection selects the right engine with English as the fallback.

Built on Microsoft Presidio

The detection pipeline is built on Microsoft Presidio, an MIT-licensed open-source PII detection framework. We extend Presidio with 317 custom regex recognizers covering 75+ countries, additional NLP engines beyond the default, and managed hosting with Zero-Knowledge architecture.

See accuracy in action

Run our detection engines against your own data and verify the numbers yourself. No commitment, no credit card.

Request Demo Compare Packages