Latest

Data sources

Every public dataset AINPI ingests, considers, or rejects — with primary-source URLs, license terms, refresh cadence, and the hypothesis each maps to.

This page exists because methodology trust starts with showing the inputs. State Medicaid agencies, auditors, and researchers can verify each row against its primary source. If a row is wrong, file an issue.

Eugene VestelFounder, FHIR IQ · Health interoperability consultant

BioLinkedIngene@fhiriq.com· Last reviewed 2026-04-29

Federal screening databases (42 CFR § 455.436)

The four databases state Medicaid agencies are legally required to check for provider identity and exclusion status, plus the CMS Preclusion List for completeness.

Federal registry of NPIs, mandatory under HIPAA. Self-attested provider demographics, taxonomy, and addresses.

License
Public domain
Refresh cadence
Continuous; full file dissemination monthly
Mapped to AINPI
H10, H11, H13 — match rate, name agreement, specialty consistency
Notes
Source: bigquery-public-data.nppes.npi_raw (BigQuery public dataset, dated). Switch-aware match against all 15 taxonomy slots per provider, not just slot 1.

Federal database of providers excluded from Medicare, Medicaid, and all other Federal health care programs under SSA §§ 1128 and 1156.

License
Public domain
Refresh cadence
Monthly full file + monthly supplement
Mapped to AINPI
H24 — LEIE-NDH match. Composite weight 1.5 in H23 high-risk cohort.
Notes
Direct download: oig.hhs.gov/exclusions/downloadables/UPDATED.csv. ~83K active rows; ~10.8% have a populated NPI (the rest are pre-NPI-era). Filter REINDATE = "00000000" to drop reinstatements.

Federal-wide debarment and suspension list across all federal contracting and assistance programs. Required check under 42 CFR § 455.436.

License
Public domain (US federal government work)
Refresh cadence
Daily updates; Public Extract V2 published as CSV
Mapped to AINPI
H25 — SAM-NDH match. Composite weight 1.5 in H23 high-risk cohort (independent of LEIE).
Notes
Loaded from the SAM.gov Public Extract V2 (sam.gov/data-services/Exclusions/Public V2). 167K rows, ~4% with a real-format NPI; the rest are non-healthcare exclusions (OFAC sanctions, EPA contractor debarment). HHS slice overlaps OIG LEIE; OPM slice (FEHBP debarment under 5 USC 8902a) is net-new federal-screening signal not visible from LEIE alone.

Social Security Administration record of deceased individuals.

License
Limited Access DMF requires Section 1110 certification under 42 USC § 1306c
Refresh cadence
Weekly updates to subscribed users
Mapped to AINPI
Notes
The Public DMF excludes deaths within the prior 3 years. Full DMF requires NTIS subscription and SSA certification — a procurement effort each state Medicaid agency manages independently.

Providers precluded from receiving payment for Medicare Advantage items, services, or Part D drugs. Created by 42 CFR § 422.222 / § 423.120.

License
Restricted — Medicare Advantage Part C plans and Part D sponsors only
Refresh cadence
Monthly, first business day of the month
Mapped to AINPI
Notes
NOT publicly downloadable. AINPI cannot ingest this list. State Medicaid agencies relying on Preclusion signal must coordinate with their MCOs directly. Documenting it here so the limitation is explicit.

Audit inputs

The data AINPI joins, validates, and aggregates to produce its published findings.

FHIR R4 NDJSON bulk export of the federal provider directory: 6 resource types, 21.7M resources at the 2026-05-08 release (down from 27.2M in April).

License
Public domain (US federal government work)
Refresh cadence
Periodic — most recent release pinned per audit
Mapped to AINPI
All findings. The NDH artifact is what AINPI audits.
Notes
Distributed as zstd-compressed NDJSON (2.8 GB compressed, 40.7 GB uncompressed). Loaded into BigQuery as resource:JSON columns plus extracted flat _* fields per resource type.

Standardized 900+ specialty classification codes used by NPPES Practitioner.qualification and CMS-Medicare crosswalks.

License
Public; permission required for redistribution
Refresh cadence
Quarterly
Mapped to AINPI
H12 — taxonomy validity. Companion to H13.
Notes
Source: bigquery-public-data.nppes.healthcare_provider_taxonomy_code_set_170. Pinned to v17.0 for the 2026-05-08 audit.

Authoritative CMS crosswalk between NUCC taxonomy codes and CMS Medicare specialty codes. 1-to-many for both directions.

License
Public domain
Refresh cadence
Quarterly
Mapped to AINPI
H13 — bridges PractitionerRole.specialty (CMS Medicare codes) to Practitioner.qualification (NUCC codes). Pinned to October 2025 release.
Notes
CSV has embedded newlines that BigQuery's default loader rejects — pipeline parses with Python csv module (RFC-4180-compliant) and streams as NDJSON.

Empirical L0–L7 reachability scoring of every Endpoint.address declared in the NDH bulk export.

License
Apache-2.0 (the crawler code); endpoints themselves are public APIs
Refresh cadence
Out-of-band; monthly today, weekly target
Mapped to AINPI
H1–H5 endpoint liveness; H22 network adequacy gauge
Notes
Polite crawler — 1 req/sec/host, named User-Agent, documented source IP. Runs on a dedicated host outside CI runners to avoid bad-neighbor behavior.

Considered, not ingested

Datasets that come up in directory-quality conversations but sit outside the current AINPI scope. Documented here so the boundary is explicit.

CAQH ProView

out-of-scope

Commercial credentialing source maintained by the Council for Affordable Quality Healthcare.

License
Subscription / member access only
Refresh cadence
Continuous
Mapped to AINPI
Notes
Not in the federal NDH ingestion pipeline (commercial-payer data). See /insights for the full provenance discussion of the CAQH gap.

Public-domain transparency database of payments from drug/device manufacturers to providers under Section 6002 of the ACA.

License
Public domain
Refresh cadence
Annual + interim updates
Mapped to AINPI
Future — provider-context enrichment, not directory accuracy. Lower priority than 42 CFR § 455.436 sources.
Notes
Could enrich /findings/[slug] pages with payments context per NPI. Not relevant to revalidation decisions per se.

Aggregated services rendered per provider per HCPCS code under Medicare Part B.

License
Public domain
Refresh cadence
Annual
Mapped to AINPI
Notes
Claims-derived; AINPI is provider-directory only. Out of scope.

Health Professional Shortage Areas, Medically Underserved Areas, and National Health Service Corps participants.

License
Public domain
Refresh cadence
Continuous
Mapped to AINPI
Future — geography enrichment for state-scoped pages (rural / underserved-area context).
Notes
Could surface "% of state Medicaid roster in an HPSA" alongside state-scoped findings.

Each state publishes its own provider exclusion list. Format and access vary widely across the 43+ state-published lists.

License
Public per state
Refresh cadence
Varies (monthly to ad-hoc)
Mapped to AINPI
Future — state-specific high-risk cohort enrichment for /states/[state] pages.
Notes
Aggregating these would be a meaningful contribution. Per-state ingestion adapters needed; some states publish CSV, others PDF, others web-scrape only.

How to add a new data source

  1. Open an issue using the "new metric proposal" or "data source addition" template. Include the primary-source URL, license, refresh cadence, and which AINPI hypothesis it would inform.
  2. Pre-register the methodology (null hypothesis, denominator) before any numbers are computed. Add a row to frontend/src/data/findings.ts with status: 'pre-registered'.
  3. Ship an ingestion script at analysis/ingest_<source>.py with a polite User-Agent, RFC-compliant CSV parsing, and an explicit BigQuery destination. License headers required.
  4. Publish the finding by writing to frontend/public/api/v1/findings/<slug>.json. Methodology version bump goes in docs/methodology/index.md.
  5. Add a row to this page so the addition is visible alongside its peers.

Source code: analysis/. PRs welcome under Apache-2.0.