Method Overview

Every PII detection system uses one of three fundamental approaches — or a combination. Each has distinct strengths, limitations, and ideal use cases.

Regex / Pattern Matching

Deterministic rules that match known data formats. 100% precision for structured data, sub-millisecond processing, no machine learning required.

  • IBANs, credit cards, SSNs
  • Email addresses, phone numbers
  • IP addresses, passport numbers
  • Fast & deterministic

NLP / Named Entity Recognition

Machine learning models — spaCy (open-source NLP library), Stanza (Stanford NLP toolkit), and XLM-RoBERTa (cross-lingual transformer model) — that understand language context. Identifies entities that have no fixed format.

  • Person names, organizations
  • Locations, dates in context
  • Context-aware disambiguation
  • Handles unstructured text

Hybrid

Combines both engines plus context analysis and checksum validation. Highest overall accuracy with the lowest false positive rate.

  • Merges NLP + Pattern results
  • Conflict resolution via confidence scoring
  • Checksum validation eliminates false positives
  • Best for production systems

How Each Method Works

The difference between these approaches is not just accuracy — it is fundamental to what kinds of PII they can detect and how they handle ambiguity.

Regex Detection

Regex (regular expression) detection uses hand-crafted patterns to match known data formats character by character. It is entirely deterministic: the same input always produces the same output. Processing time is measured in microseconds, making it the fastest detection method available.

The limitation is structural: regex can only detect PII that follows a predictable format. It excels at IBANs, credit card numbers, Social Security numbers, email addresses, and phone numbers — but cannot detect a person's name, because names have no fixed pattern.

  • 100% deterministic — identical results every run
  • Sub-millisecond processing per entity
  • No model downloads or ML infrastructure required
  • Limited to known, structured formats
EMAIL DETECTION PATTERN
/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/
IBAN DETECTION PATTERN
/[A-Z]{2}\d{2}[\s]?[\dA-Z]{4}[\s]?(?:[\dA-Z]{4}[\s]?){2,7}[\dA-Z]{1,4}/
CREDIT CARD PATTERN (with Luhn check)
/\b(?:\d[\s-]*?){13,19}\b/
+ Luhn checksum validation post-match

NLP Detection

NLP (Natural Language Processing) detection uses trained statistical models to recognize entity boundaries in context. Instead of matching patterns, the model reads surrounding words to determine whether a token is a person, organization, location, or date.

For example, in “Dr. Smith called John yesterday,” the model identifies “Smith” and “John” as PERSON entities based on context — not because names follow a pattern. This makes NLP essential for detecting PII that has no fixed format.

  • Context-aware — understands surrounding text
  • Handles ambiguity (“Jordan” as person vs. country)
  • Multi-language: spaCy (25), Stanza (7), XLM-RoBERTa (16)
  • Requires model download and inference time
NLP ENGINE SELECTION
spaCy
PRIMARY — 25 languages, fastest inference
Stanza
SECONDARY — 7 languages not covered by spaCy
XLM-R
FALLBACK — 16 additional languages via transformer

Engine is auto-selected based on detected language. No overlap — each language belongs to exactly one NLP engine.

Hybrid Detection

Hybrid detection runs both the Regex and NLP engines in a single pass, then merges results using confidence scoring and context-aware recognition. When both engines detect the same span of text, the algorithm applies confidence scoring and conflict resolution to select the most accurate result.

Checksum validation on structured data (Luhn for credit cards, MOD-97 for IBANs) eliminates false positives that pattern matching alone might produce. Context from the NLP engine resolves ambiguous detections. The result is the highest overall accuracy with the lowest false positive rate.

  • Runs both engines simultaneously
  • Confidence scoring merges overlapping detections
  • Checksum validation eliminates structured data false positives
  • Confidence scoring enables precision/recall tuning
HYBRID MERGE PIPELINE
1
Run Regex Engine (317 recognizers)
2
Run NLP Engine (auto-selected by language)
3
Merge results — resolve overlapping spans
4
Validate checksums (Luhn, MOD-97, format rules)
5
Apply confidence scoring & output final entities

Method Comparison

Each detection method has different strengths. Regex excels at structured data, NLP excels at context-dependent entities, and Hybrid combines both for the highest coverage.

Characteristic Regex Only NLP Only Hybrid
Structured Data Very high (95%+) Not applicable Very high (95%+)
Name Detection Very limited Strong (context-aware) Strong (context-aware)
Speed Fastest Medium Medium
False Positives Low (checksum validation) Medium Lowest
Language Support Pattern-dependent 48 languages 48 languages
Setup Complexity Simple Model required Managed
Context Awareness None High Highest

Why NLP is essential for names

Regex cannot effectively detect names because names have no fixed format. A regex can match “Mr.” or “Dr.” prefixes, but cannot distinguish “Jordan” the person from “Jordan” the country, or identify “Smith” as a surname without surrounding context. This is precisely where NLP excels.

Where Each Method Wins

Different document types require different detection strategies. These real-world scenarios illustrate why hybrid detection consistently outperforms single-engine approaches.

Medical Records

Medical documents contain a mix of patient names (NLP-dependent), medical record numbers (Regex-dependent), and diagnoses with contextual dates. Neither engine alone achieves full coverage.

Patient: [NAME] ← NLP
MRN: [MRN] ← Regex
Hybrid wins — catches both

Financial Documents

IBANs and credit card numbers are perfectly matched by Regex with checksum validation. But account holder names, beneficiary fields, and transaction descriptions require NLP to detect.

IBAN: [IBAN] ← Regex + MOD-97
Beneficiary: [NAME] ← NLP
Hybrid wins — validates + detects

Legal Contracts

Party names, physical addresses, dates of birth, and witness signatures are heavily context-dependent. Regex catches formatted tax IDs, but the majority of PII in contracts requires NLP.

Party: [ORG] ← NLP
Signatory: [NAME] ← NLP
NLP essential — context-heavy

Log Files

Server logs contain IP addresses, email addresses, and URLs that Regex handles perfectly. But user-agent strings and error messages may embed user names or file paths containing PII.

IP: [IP] ← Regex
User path: /home/[NAME]/ ← NLP
Hybrid catches both

Three-Layer Detection Engine

Our production system uses a three-layer architecture that maximizes detection accuracy while minimizing false positives. Each layer has a specific role.

Layer 1: Regex Engine

317 regex recognizers covering structured data formats across all supported jurisdictions. Each pattern includes format validation, and many include checksum verification (Luhn, MOD-97, format rules).

  • 317 regex recognizers
  • Sub-millisecond per entity
  • Checksum validation built-in
  • Country-specific format rules

Layer 2: NLP Engine

Automatic engine selection based on detected language. spaCy serves as the primary engine (25 languages), Stanza as secondary (7 languages), and XLM-RoBERTa as the fallback (16 languages). No overlap between engines.

  • spaCy PRIMARY (25 languages)
  • Stanza SECONDARY (7 languages)
  • XLM-RoBERTa FALLBACK (16 languages)
  • Auto-selected by language detection

Layer 3: Confidence Scoring & Validation

The merge and validation layer. Resolves conflicts when both engines detect the same span, applies checksum validation to structured detections, and assigns confidence scores from 0.0 to 1.0.

  • Conflict resolution for overlapping spans
  • Checksum validation (Luhn, MOD-97)
  • Confidence scoring (0.0–1.0)
  • Language-specific error reduction

Deterministic, Not AI-Based

Our detection pipeline uses NLP models and regex patterns — not LLMs. This means identical input always produces identical output. No hallucinations, no variability between runs, and no dependency on external AI APIs. Every detection is reproducible and auditable.

Production Impact

Multi-layered validation (checksum verification, confidence scoring, context-aware recognition) significantly reduces false positives compared to baseline Microsoft Presidio (the open-source PII detection framework our engines build on). For a 10,000-document batch, that translates to thousands fewer false flags — saving hours of manual review and increasing trust in automated anonymization pipelines.

See our engines in action

Run all three detection methods against your own data and compare the results yourself. No commitment required.