What is HIPAA Safe Harbor De-identification?
Under 45 CFR § 164.514(b), covered entities and their business associates may de-identify protected health information using one of two methods:
- Safe Harbor Method (§ 164.514(b)(2)): Remove all 18 specified identifiers and verify that the covered entity has no actual knowledge that the remaining information could be used to identify an individual.
- Expert Determination Method (§ 164.514(b)(1)): A qualified statistical expert certifies that the risk of identification is "very small."
Safe Harbor is the more commonly used method because it provides a clear, bright-line rule. If all 18 identifiers are removed, the data is de-identified by definition — no statistical expertise required. This guide focuses on Safe Harbor.
Key benefit: De-identified health information is not PHI. It falls outside the scope of HIPAA's Privacy Rule entirely. You can share it freely for research, publish it, use it to train AI models, and process it with third-party tools — no BAA required.
The 18 PHI Identifiers Under 45 CFR § 164.514(b)(2)
Every one of the following identifiers must be removed or transformed to achieve Safe Harbor de-identification:
| # | Identifier | Examples | How anonymize.solutions Handles It |
|---|---|---|---|
| 1 | Names | Patient name, next of kin | NLP entity detection → Replace with [PERSON_N] |
| 2 | Geographic data smaller than state | Street, city, county, ZIP | Address detection → Replace; ZIP: first 3 digits (see below) |
| 3 | Dates (except year) | Birth date, admission date, discharge date, date of death | Date detection → Replace with [DATE_N] or year-only |
| 4 | Phone numbers | Home, mobile, work, fax | Regex + NLP → Replace with [PHONE_N] |
| 5 | Fax numbers | Provider fax, facility fax | Pattern detection → Replace with [FAX_N] |
| 6 | Email addresses | Patient email, provider email | Regex → Replace with [EMAIL_N] |
| 7 | Social Security numbers | Full or partial SSN | Regex (9-digit, hyphenated) → Replace with [SSN_N] |
| 8 | Medical record numbers | MRN, patient ID | Context-aware NLP → Replace with [MRN_N] |
| 9 | Health plan beneficiary numbers | Insurance ID, member ID | Pattern + context → Replace with [INSURANCE_ID_N] |
| 10 | Account numbers | Bank account, billing account | Pattern detection → Replace with [ACCOUNT_N] |
| 11 | Certificate/license numbers | NPI, DEA, medical license | Pattern + NLP → Replace with [LICENSE_N] |
| 12 | Vehicle identifiers | VIN, license plate | Pattern detection → Replace with [VEHICLE_ID_N] |
| 13 | Device identifiers | Serial numbers, MAC addresses | Pattern + NLP → Replace with [DEVICE_ID_N] |
| 14 | Web URLs | Personal profile URLs, patient portals | URL detection → Replace with [URL_N] |
| 15 | IP addresses | IPv4, IPv6 | Regex → Replace with [IP_N] |
| 16 | Biometric identifiers | Fingerprints, voiceprints, retina scans | Metadata detection → Flag for manual review |
| 17 | Full-face photos | Patient photos, intake photos | Metadata detection → Flag; image processing via separate API |
| 18 | Any other unique identifying number | Account codes, study IDs unique to individual | Custom entity patterns configurable per deployment |
Safe Harbor vs Expert Determination: When to Use Which
| Factor | Safe Harbor | Expert Determination |
|---|---|---|
| Expertise required | None — rule-based | Qualified statistician required |
| Data utility | Lower — all 18 identifiers removed | Higher — only re-identification risk reduced |
| Cost | Low — automated implementation | High — expert fees + documentation |
| Audit defensibility | Very high — bright-line rule | High — but depends on expert methodology |
| Best for | Routine data sharing, AI training, research | Complex datasets where full removal destroys utility |
The "No Actual Knowledge" Requirement
Even after removing all 18 identifiers, § 164.514(b)(2)(ii) requires that the covered entity "does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual."
This means that small-cell data poses a risk. If your de-identified dataset contains only one patient with a particular rare condition in a particular age range in a particular geographic area, the combination of remaining data points may still be identifying — even with all 18 identifiers removed.
Practical guidance:
- Apply cell suppression for demographic combinations with fewer than 5 individuals
- Generalize age to 5-year bands rather than exact year when the dataset is small
- Document your analysis showing no actual knowledge of residual identification risk
- For sensitive research, consider Expert Determination despite the higher cost
Geographic Data: The ZIP Code Rule
The geographic data rule under § 164.514(b)(2)(i)(B) is nuanced and frequently misunderstood. The regulation requires removal of all geographic subdivisions smaller than a state — except that the first three digits of a ZIP code may be retained if the geographic unit formed by all ZIP codes with the same three initial digits contains more than 20,000 people.
In practice:
- All ZIP codes whose first three digits represent fewer than 20,000 people must be replaced with "000"
- The Census Bureau publishes population data by 3-digit ZIP prefix — this list must be maintained and updated
- Street address, city, county, precinct, and other geographic units smaller than state must always be removed
- State-level data may be retained
The anonymize.solutions HIPAA preset handles this automatically using a maintained list of qualifying 3-digit ZIP prefixes updated with each Census Bureau release.
Date Restrictions: What You Can and Cannot Keep
Dates are among the most commonly mishandled PHI identifiers. The rule:
- Must remove: All elements of dates (except year) directly related to the individual — including birth date, admission date, discharge date, date of death, and exact ages over 89
- May keep: Year only (e.g., "2024" instead of "March 15, 2024")
- Special rule for age 90+: All ages 90 and above must be aggregated into a single category ("90 or older") — even the year of birth must be removed for these individuals
- Date shifting: Not permitted under Safe Harbor (it preserves temporal relationships). Consider Expert Determination if date intervals are required.
Automated Safe Harbor Implementation
The anonymize.solutions API provides a dedicated HIPAA Safe Harbor preset that automatically handles all 18 identifiers according to the regulatory specifications:
import requests
API_KEY = "your-api-key"
def hipaa_safe_harbor(text: str) -> dict:
"""De-identify text using HIPAA Safe Harbor method (45 CFR § 164.514(b)(2))."""
response = requests.post(
"https://api.anonymize.solutions/v1/anonymize",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"text": text,
"preset": "hipaa",
"method": "safe_harbor",
"zip_rule": True, # Apply 3-digit ZIP population rule
"age_90_rule": True, # Aggregate ages 90+ as "90 or older"
"date_year_only": True # Keep year, remove day/month
}
)
result = response.json()
return result
# Example input:
sample = """
Patient: John Michael Davis
DOB: 03/15/1941 (age 84)
SSN: 234-56-7890
MRN: MRN-78234
Address: 42 Maple Street, Boston, MA 02115
Admission: 02/14/2026
Discharge: 02/18/2026
Diagnosis: Type 2 diabetes with peripheral neuropathy
"""
result = hipaa_safe_harbor(sample)
print(result["text"])
# Patient: [PERSON_1]
# DOB: 1941 (age 84)
# SSN: [SSN_1]
# MRN: [MRN_1]
# Address: [ADDRESS_1], MA 021[SUPPRESSED - small population]
# Admission: 2026
# Discharge: 2026
# Diagnosis: Type 2 diabetes with peripheral neuropathyNote that the diagnosis — which is clinical information not itself an identifier — is preserved. Only the 18 specified identifiers are removed. This is the key advantage of Safe Harbor over over-aggressive redaction: clinical utility is maintained.
Documentation Requirements
To demonstrate Safe Harbor compliance in an audit, maintain the following documentation:
- De-identification policy: Written policy specifying which method (Safe Harbor) is used, which entity types are removed, and who is responsible
- Technical specification: Documentation of the tools used (e.g., "anonymize.solutions API, HIPAA preset, version X.Y"), including version history
- Processing log: Record of each de-identification run — timestamp, document count, identifiers removed — for audit trail purposes
- No actual knowledge attestation: For each dataset released, a brief documented analysis confirming no residual identification risk is known
- Recipient agreements: While de-identified data is not PHI, document who receives it and for what purpose — good practice that supports your overall compliance posture
The anonymize.solutions Managed Private package includes automated compliance logging and a compliance export feature that generates a ready-to-file de-identification documentation report.