Electronic Health Record Datasets

1. All of Us Research Program

The All of Us Research Program is a historic, longitudinal effort initiated by the National Institutes of Health (NIH) in the United States. It aims to gather data from one million or more people living in the United States to accelerate research and improve health. Its objectives include taking into account individual differences in lifestyle, environment, and biology to find ways to deliver precision medicine.

All of Us | View All-of-Us Videos

2. UK Biobank

The UK Biobank is a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants. The project aims to enable research into the prevention, diagnosis, and treatment of a wide range of serious and life-threatening illnesses. It includes EHR data sourced from the National Health Service (NHS), among other types of data.

UK Biobank | View UK Biobank EHR Data

3. MIMIC-III

The Medical Information Mart for Intensive Care (MIMIC)-III database provided critical care data for over 40,000 patients admitted to intensive care units at the Beth Israel Deaconess Medical Center (BIDMC).

MIMIC-III | Publication

4. MIMIC-IV

MIMIC-IV aims to carry on the success of MIMIC-III, with a number of changes to improve usability of the data and enable more research applications.

MIMIC-IV | Publication

5. PCORnet

PCORnet is a national resource, funded by PCORI, that enables insights from high quality health data, patient partnership, and research expertise deliver fast, trustworthy answers that advance health outcomes.

PCORnet

6. eICU

The eICU Collaborative Research Database, a freely available multi-center database for critical care research.

eICU | Code

7. AmsterdamUMCdb

The first freely accessible intensive care database from within the European Union containing de-identified health data related to tens of thousands of European intensive care unit admissions, including demographics, vital signs, laboratory tests and medications.

AmsterdamUMCdb | Github

8. HiRID

HiRID is a freely accessible critical care dataset containing data relating to more than 33 thousand admissions to the Department of Intensive Care Medicine of the Bern University Hospital, Switzerland, an interdisciplinary 60-bed unit admitting >6,500 patients per year.

HiRID | Publication

9. GENIE

GENIE (Generative Note Information Extraction) is an end-to-end model designed to structure free text from electronic health records (EHRs). It processes EHRs in a single pass, extracting biomedical named entities along with their assertion statuses, body locations, modifiers, values, units, and intended purposes, outputting this information in a structured JSON format.

GENIE

10. PMC-Patients

PMC-Patients is a first-of-its-kind dataset consisting of 167k patient summaries extracted from case reports in PubMed Central (PMC), 3.1M patient-article relevance and 293k patient-patient similarity annotations defined by PubMed citation graph.

Supported Tasks: PMC-Patients serves as the foundation for benchmarking Retrieval-based Clinical Decision Support systems through two key tasks: Patient-to-Article Retrieval, which focuses on identifying the most relevant articles for a given patient case, and Patient-to-Patient Retrieval, which identifies similar patient cases to assist in clinical reasoning and decision-making.

Notes: This dataset is generated from free text available in PMC.

PMC-Patients | Publication

11. RareArena

A Comprehensive Rare Disease Diagnostic Dataset with nearly 50,000 patients covering more than 4000 diseases.

Notes:This dataset is derived from PMC-Patients and has been further developed to focus on rare diseases.

RareArena