Duplicate detection
Same-NPI-multiple-resource-IDs for Practitioner, and normalized-name-plus-address collapse for Organization.
Headline
Practitioner dedup is clean — 0 excess rows across 7,441,212 NPIs (H14). But Organizations multiply: 70.5% of the 1,999,118 unique Org NPIs map to more than one Organization resource (1,415,777 excess rows; max 5 resources per one NPI). By normalized (name, state, city), 70.3% of keys repeat. Downstream consumers assuming one Organization resource = one real-world entity will be wrong roughly two out of three times.
1.2M / 9.2M = 13.35%
unit: percent
What this means
Everyone using NDH
COUNT(Organization) is roughly 2× the number of unique real-world organizations. De-duplicate by _npi before treating org counts as unique entities. Practitioner dedup is clean (0 excess rows).
Payer data teams
An org that appears multiple times in NDH under different resource IDs may be legitimate (one FHIR Organization per service location) or defect (true duplicate). Either way, your match-to-internal-roster logic needs a normalization pass on NPI or (name, state, city).
Researchers
70% of Org NPIs map to multiple Organization resources. Any study that treats NDH Organization count as a population figure will be inflated by ~1.7× at the entity level.
Null hypothesis
Duplicate rate is below 1% for both Practitioner (by NPI) and Organization (by normalized name + address).
Denominator
All `Practitioner` and `Organization` resources.
Data source
CMS NPD bulk export.
Notes
BigQuery dataset has primary-key dedup applied at ingest (-4.6M Practitioner, -383K Organization at _id). These are residual entity-level duplicates. H14 key = _npi on practitioner. Max copies observed: 1 for a single Practitioner NPI. H15 key = (LOWER(name) stripped of LLC/INC/PC/PA/PLLC/CORP/LLP/LTD/CO/COMPANY/THE and non-alphanumerics, UPPER(state), UPPER(TRIM(city))); orgs with missing name or state or city are excluded. Max copies for one key: 2206. H15-bonus keys by _npi; max copies for one Org NPI: 5. Caveat — some portion of the Organization multiplicity may reflect CMS modeling one FHIR Organization resource per service location rather than true duplication. Either interpretation breaks the common downstream assumption that COUNT(Organization) equals the number of unique organizations. Fuzzy matching (Jaro-Winkler, suite-unit tolerance) is a v2 enhancement.