48
Languages
Full PII detection
25
spaCy
PRIMARY
7
Stanza
SECONDARY
16
XLM-RoBERTa
FALLBACK
4
RTL Scripts
Arabic, Hebrew, Persian, Urdu

25 Languages with spaCy NLP

Highest accuracy Named Entity Recognition using language-specific statistical models. spaCy is the primary engine for major European and Asian languages.

Catalan
ca
Chinese
zh
Croatian
hr
Danish
da
Dutch
nl
English
en
Finnish
fi
French
fr
German
de
Greek
el
Italian
it
Japanese
ja
Korean
ko
Norwegian
no
Polish
pl
Portuguese
pt
Russian
ru
Spanish
es
Swedish
sv
Bulgarian
bg
Hungarian
hu
Lithuanian
lt
Romanian
ro
Slovenian
sl
Ukrainian
uk

About spaCy NLP

spaCy is an open-source library for industrial-strength Natural Language Processing developed by Explosion AI. These language-specific statistical models recognize person names, organizations, locations, and other named entities with the highest accuracy. Primary choice for major languages.

7 Languages with Stanza Neural NLP

Specialized neural network models from Stanford NLP for RTL scripts and Southeast Asian languages. Stanza excels at complex grammatical structures.

Arabic
ar • RTL
Hebrew
he • RTL
Hindi
hi
Indonesian
id
Persian (Farsi)
fa • RTL
Thai
th
Vietnamese
vi

About Stanza NLP

Stanza is a Python NLP library developed by Stanford NLP Group. It uses bidirectional LSTM neural networks with character-level embeddings, providing strong accuracy for languages with complex morphology, RTL scripts, and non-Latin writing systems.

16 Languages with XLM-RoBERTa Transformer

Cross-lingual transformer model for low-resource languages. XLM-RoBERTa enables NER detection through transfer learning across 100+ languages.

Afrikaans
af
Albanian
sq
Amharic
am
Basque
eu
Bengali
bn
Czech
cs
Estonian
et
Latvian
lv
Malay
ms
Serbian
sr
Slovak
sk
Swahili
sw
Tagalog
tl
Turkish
tr
Urdu
ur • RTL
Welsh
cy

About XLM-RoBERTa

XLM-RoBERTa is a cross-lingual transformer model developed by Facebook AI, pre-trained on 2.5TB of text in 100 languages. It enables Named Entity Recognition for low-resource languages through transfer learning, providing broad coverage where dedicated models don't exist.

Multilingual Detection Features

Automatic Language Detection

No configuration needed. The system automatically detects the language of your text and selects the appropriate NLP model.

  • Per-Document Detection — Each document analyzed independently
  • Mixed Language Support — Documents with multiple languages handled correctly
  • Confidence Threshold — Language detection with confidence scoring
  • Fallback Strategy — Uses regex patterns when NLP model unavailable

Right-to-Left (RTL) Support

Full support for RTL scripts with proper text direction handling during detection and anonymization.

Arabic
Full RTL support
Hebrew
Full RTL support
Persian (Farsi)
Full RTL support
Urdu
Full RTL support

What Gets Detected in Each Language

All 48 languages support the same core PII entity types through a combination of NLP models and regex patterns.

Person Names

First names, last names, full names, nicknames, and titles detected via NLP Named Entity Recognition.

Locations

Addresses, cities, countries, postal codes, and geographic locations identified by NLP models.

Contact Info

Email addresses, phone numbers, and URLs detected via regex patterns across all languages.

Financial Data

Credit card numbers, IBANs, bank accounts, and financial identifiers via format-specific patterns.

Dates

Dates of birth, dates, and temporal expressions in various formats specific to each locale.

Government IDs

SSNs, passport numbers, driver licenses, and national IDs via country-specific regex patterns.

AIR-GAPPED DESKTOP APP

The Air-Gapped Desktop App includes local NLP models for 15+ languages with offline processing. No internet connection required.

Learn About Air-Gapped Edition

Need a language not listed?

Contact us for custom language model integration or to discuss your specific multilingual requirements.