Supported Languages — 48 Languages for PII Detection

Languages

Full PII detection

spaCy

PRIMARY

Stanza

SECONDARY

XLM-RoBERTa

FALLBACK

RTL Scripts

Arabic, Hebrew, Persian, Urdu

PRIMARY Engine

25 Languages with spaCy NLP

Highest accuracy Named Entity Recognition using language-specific statistical models. spaCy is the primary engine for major European and Asian languages.

Catalan

ca

Chinese

zh

Croatian

hr

Danish

da

Dutch

nl

English

en

Finnish

fi

French

fr

German

de

Greek

el

Italian

it

Japanese

ja

Korean

ko

Norwegian

no

Polish

pl

Portuguese

pt

Russian

ru

Spanish

es

Swedish

sv

Bulgarian

bg

Hungarian

hu

Lithuanian

lt

Romanian

ro

Slovenian

sl

Ukrainian

uk

About spaCy NLP

spaCy is an open-source library for industrial-strength Natural Language Processing developed by Explosion AI. These language-specific statistical models recognize person names, organizations, locations, and other named entities with the highest accuracy. Primary choice for major languages.

SECONDARY Engine

7 Languages with Stanza Neural NLP

Specialized neural network models from Stanford NLP for RTL scripts and Southeast Asian languages. Stanza excels at complex grammatical structures.

Arabic

ar • RTL

Hebrew

he • RTL

Hindi

hi

Indonesian

id

Persian (Farsi)

fa • RTL

Thai

th

Vietnamese

vi

About Stanza NLP

Stanza is a Python NLP library developed by Stanford NLP Group. It uses bidirectional LSTM neural networks with character-level embeddings, providing strong accuracy for languages with complex morphology, RTL scripts, and non-Latin writing systems.

FALLBACK Engine

16 Languages with XLM-RoBERTa Transformer

Cross-lingual transformer model for low-resource languages. XLM-RoBERTa enables NER detection through transfer learning across 100+ languages.

Afrikaans

af

Albanian

sq

Amharic

am

Basque

eu

Bengali

bn

Czech

cs

Estonian

et

Latvian

lv

Malay

ms

Serbian

sr

Slovak

sk

Swahili

sw

Tagalog

tl

Turkish

tr

Urdu

ur • RTL

Welsh

cy

About XLM-RoBERTa

XLM-RoBERTa is a cross-lingual transformer model developed by Facebook AI, pre-trained on 2.5TB of text in 100 languages. It enables Named Entity Recognition for low-resource languages through transfer learning, providing broad coverage where dedicated models don't exist.

Features

Multilingual Detection Features

Automatic Language Detection

No configuration needed. The system automatically detects the language of your text and selects the appropriate NLP model.

Per-Document Detection — Each document analyzed independently
Mixed Language Support — Documents with multiple languages handled correctly
Confidence Threshold — Language detection with confidence scoring
Fallback Strategy — Uses regex patterns when NLP model unavailable

Right-to-Left (RTL) Support

Full support for RTL scripts with proper text direction handling during detection and anonymization.

Arabic

Full RTL support

Hebrew

Full RTL support

Persian (Farsi)

Full RTL support

Urdu

Full RTL support

Detection

What Gets Detected in Each Language

All 48 languages support the same core PII entity types through a combination of NLP models and regex patterns.

Person Names

First names, last names, full names, nicknames, and titles detected via NLP Named Entity Recognition.

Locations

Addresses, cities, countries, postal codes, and geographic locations identified by NLP models.

Contact Info

Email addresses, phone numbers, and URLs detected via regex patterns across all languages.

Financial Data

Credit card numbers, IBANs, bank accounts, and financial identifiers via format-specific patterns.

Dates

Dates of birth, dates, and temporal expressions in various formats specific to each locale.

Government IDs

SSNs, passport numbers, driver licenses, and national IDs via country-specific regex patterns.

AIR-GAPPED DESKTOP APP

The Air-Gapped Desktop App includes local NLP models for 15+ languages with offline processing. No internet connection required.

Learn About Air-Gapped Edition

Detection Limitations

Language Detection Trade-offs to Consider

Our 48-language coverage is the broadest in the market. However, there are honest limitations between languages that matter for production decisions.

Regex vs. NLP Coverage

All 48 languages support regex-based detection (phone numbers, IDs, emails). However, contextual NLP recognition — names, organisations, locations — requires spaCy or Stanza models. Languages supported only via XLM-RoBERTa have a limitation: lower contextual accuracy than spaCy-based languages.

Mixed-Language Documents

Documents mixing multiple languages in the same text block can reduce detection accuracy. The engine detects the dominant language and applies the corresponding model. Bear in mind that code-switching (e.g., Spanish text with English labels) may cause some entities to be missed if the minority language isn't the configured primary.

Rare Scripts and Dialects

While we support 48 standardised languages, regional dialects, minority scripts, and code-mixed social media text are not a supported use case. The trade-off of statistical NLP models is that they perform best on formal written text similar to their training corpus.