Data sources
Every public dataset AINPI ingests, considers, or rejects — with primary-source URLs, license terms, refresh cadence, and the hypothesis each maps to.
This page exists because methodology trust starts with showing the inputs. State Medicaid agencies, auditors, and researchers can verify each row against its primary source. If a row is wrong, file an issue.
Eugene Vestel — Founder, FHIR IQ · Health interoperability consultant
BioLinkedIngene@fhiriq.com· Last reviewed 2026-04-29
Federal screening databases (42 CFR § 455.436)
The four databases state Medicaid agencies are legally required to check for provider identity and exclusion status, plus the CMS Preclusion List for completeness.
Federal registry of NPIs, mandatory under HIPAA. Self-attested provider demographics, taxonomy, and addresses.
- License
- Public domain
- Refresh cadence
- Continuous; full file dissemination monthly
- Mapped to AINPI
- H10, H11, H13 — match rate, name agreement, specialty consistency
- Notes
- Source: bigquery-public-data.nppes.npi_raw (BigQuery public dataset, dated). Switch-aware match against all 15 taxonomy slots per provider, not just slot 1.
Federal database of providers excluded from Medicare, Medicaid, and all other Federal health care programs under SSA §§ 1128 and 1156.
- License
- Public domain
- Refresh cadence
- Monthly full file + monthly supplement
- Mapped to AINPI
- H24 — LEIE-NDH match. Composite weight 1.5 in H23 high-risk cohort.
- Notes
- Direct download: oig.hhs.gov/exclusions/downloadables/UPDATED.csv. ~83K active rows; ~10.8% have a populated NPI (the rest are pre-NPI-era). Filter REINDATE = "00000000" to drop reinstatements.
Federal-wide debarment and suspension list across all federal contracting and assistance programs. Required check under 42 CFR § 455.436.
- License
- Public domain (US federal government work)
- Refresh cadence
- Daily updates; Public Extract V2 published as CSV
- Mapped to AINPI
- H25 — SAM-NDH match. Composite weight 1.5 in H23 high-risk cohort (independent of LEIE).
- Notes
- Loaded from the SAM.gov Public Extract V2 (sam.gov/data-services/Exclusions/Public V2). 167K rows, ~4% with a real-format NPI; the rest are non-healthcare exclusions (OFAC sanctions, EPA contractor debarment). HHS slice overlaps OIG LEIE; OPM slice (FEHBP debarment under 5 USC 8902a) is net-new federal-screening signal not visible from LEIE alone.
SSA Death Master File (DMF)
out-of-scopeSocial Security Administration record of deceased individuals.
- License
- Limited Access DMF requires Section 1110 certification under 42 USC § 1306c
- Refresh cadence
- Weekly updates to subscribed users
- Mapped to AINPI
- —
- Notes
- The Public DMF excludes deaths within the prior 3 years. Full DMF requires NTIS subscription and SSA certification — a procurement effort each state Medicaid agency manages independently.
CMS Preclusion List
out-of-scopeProviders precluded from receiving payment for Medicare Advantage items, services, or Part D drugs. Created by 42 CFR § 422.222 / § 423.120.
- License
- Restricted — Medicare Advantage Part C plans and Part D sponsors only
- Refresh cadence
- Monthly, first business day of the month
- Mapped to AINPI
- —
- Notes
- NOT publicly downloadable. AINPI cannot ingest this list. State Medicaid agencies relying on Preclusion signal must coordinate with their MCOs directly. Documenting it here so the limitation is explicit.
Audit inputs
The data AINPI joins, validates, and aggregates to produce its published findings.
FHIR R4 NDJSON bulk export of the federal provider directory: 6 resource types, 21.7M resources at the 2026-05-08 release (down from 27.2M in April).
- License
- Public domain (US federal government work)
- Refresh cadence
- Periodic — most recent release pinned per audit
- Mapped to AINPI
- All findings. The NDH artifact is what AINPI audits.
- Notes
- Distributed as zstd-compressed NDJSON (2.8 GB compressed, 40.7 GB uncompressed). Loaded into BigQuery as resource:JSON columns plus extracted flat _* fields per resource type.
Standardized 900+ specialty classification codes used by NPPES Practitioner.qualification and CMS-Medicare crosswalks.
- License
- Public; permission required for redistribution
- Refresh cadence
- Quarterly
- Mapped to AINPI
- H12 — taxonomy validity. Companion to H13.
- Notes
- Source: bigquery-public-data.nppes.healthcare_provider_taxonomy_code_set_170. Pinned to v17.0 for the 2026-05-08 audit.
Authoritative CMS crosswalk between NUCC taxonomy codes and CMS Medicare specialty codes. 1-to-many for both directions.
- License
- Public domain
- Refresh cadence
- Quarterly
- Mapped to AINPI
- H13 — bridges PractitionerRole.specialty (CMS Medicare codes) to Practitioner.qualification (NUCC codes). Pinned to October 2025 release.
- Notes
- CSV has embedded newlines that BigQuery's default loader rejects — pipeline parses with Python csv module (RFC-4180-compliant) and streams as NDJSON.
Live FHIR endpoints (probe)
ingestedEmpirical L0–L7 reachability scoring of every Endpoint.address declared in the NDH bulk export.
- License
- Apache-2.0 (the crawler code); endpoints themselves are public APIs
- Refresh cadence
- Out-of-band; monthly today, weekly target
- Mapped to AINPI
- H1–H5 endpoint liveness; H22 network adequacy gauge
- Notes
- Polite crawler — 1 req/sec/host, named User-Agent, documented source IP. Runs on a dedicated host outside CI runners to avoid bad-neighbor behavior.
Considered, not ingested
Datasets that come up in directory-quality conversations but sit outside the current AINPI scope. Documented here so the boundary is explicit.
CAQH ProView
out-of-scopeCommercial credentialing source maintained by the Council for Affordable Quality Healthcare.
- License
- Subscription / member access only
- Refresh cadence
- Continuous
- Mapped to AINPI
- —
- Notes
- Not in the federal NDH ingestion pipeline (commercial-payer data). See /insights for the full provenance discussion of the CAQH gap.
CMS Open Payments
roadmapPublic-domain transparency database of payments from drug/device manufacturers to providers under Section 6002 of the ACA.
- License
- Public domain
- Refresh cadence
- Annual + interim updates
- Mapped to AINPI
- Future — provider-context enrichment, not directory accuracy. Lower priority than 42 CFR § 455.436 sources.
- Notes
- Could enrich /findings/[slug] pages with payments context per NPI. Not relevant to revalidation decisions per se.
CMS Medicare Provider Utilization & Payment
out-of-scopeAggregated services rendered per provider per HCPCS code under Medicare Part B.
- License
- Public domain
- Refresh cadence
- Annual
- Mapped to AINPI
- —
- Notes
- Claims-derived; AINPI is provider-directory only. Out of scope.
HRSA HPSA / MUA / NHSC
roadmapHealth Professional Shortage Areas, Medically Underserved Areas, and National Health Service Corps participants.
- License
- Public domain
- Refresh cadence
- Continuous
- Mapped to AINPI
- Future — geography enrichment for state-scoped pages (rural / underserved-area context).
- Notes
- Could surface "% of state Medicaid roster in an HPSA" alongside state-scoped findings.
Each state publishes its own provider exclusion list. Format and access vary widely across the 43+ state-published lists.
- License
- Public per state
- Refresh cadence
- Varies (monthly to ad-hoc)
- Mapped to AINPI
- Future — state-specific high-risk cohort enrichment for /states/[state] pages.
- Notes
- Aggregating these would be a meaningful contribution. Per-state ingestion adapters needed; some states publish CSV, others PDF, others web-scrape only.
How to add a new data source
- Open an issue using the "new metric proposal" or "data source addition" template. Include the primary-source URL, license, refresh cadence, and which AINPI hypothesis it would inform.
- Pre-register the methodology (null hypothesis, denominator) before any numbers are computed. Add a row to
frontend/src/data/findings.tswithstatus: 'pre-registered'. - Ship an ingestion script at
analysis/ingest_<source>.pywith a polite User-Agent, RFC-compliant CSV parsing, and an explicit BigQuery destination. License headers required. - Publish the finding by writing to
frontend/public/api/v1/findings/<slug>.json. Methodology version bump goes indocs/methodology/index.md. - Add a row to this page so the addition is visible alongside its peers.
Source code: analysis/. PRs welcome under Apache-2.0.