Datasets

Breast Cancer
UCI
OpenML

This dataset includes 286 instances described by nine attributes, including categorical features. This is an example of imbalanced data. The goal of corresponding predictive task is to predict the occurrence of breast cancer.

Diabetes 130 US
UCI
OpenML
Kaggle

The dataset represents EHR results saved for ten years (1999-2008) in clinical care units at 130 US hospitals and integrated delivery networks. Data includes 101766 observations, a description of the patient's condition at the time of admission, information about the diagnosis, and the number of tests performed.

Diagnosis of COVID-19 (Subset)
Kaggle

The dataset contains anonymized information about patients admitted at the Hospital Israelita Albert Einstein in São Paulo, Brazil. The goal of admission was to perform the SARS-CoV-2 RT-PCR. Next to that also, additional laboratory tests were performed during a visit to the hospital. The dataset was published in 2020.

GOSSIS-1-eICU Model Ready
PhysioNet

Data are collected in the project including the subset of patients in the USA derived from the eICU Collaborative Research Database (eICU-CRD) as Global Open Source Severity of Illness Score. The dataset consists of the information reported within the first 24 hours after admission for 131 thousand unique patients from 204 hospitals from ICU admissions discharged in 2014-15.

HCV data
UCI
OpenML
Kaggle

The dataset contains results for 615 patients, who are blood donors and Hepatitis C patients. Demographic features like age are reported next to laboratory results.

Heart Disease (Comprehensive)
OpenML

This dataset is curated by combining five datasets over 11 standard features, making it the largest heart disease dataset available for research. Despite sharing this data on OpenML, it comes from separate research studies and is merged as a result of the meta-analysis.

Hepatitis
UCI
Kaggle

Data for mortality prediction among patients with hepatitis symptoms, including fatigue, anorexia, or big liver. As EHR results, we consider information about albumin and bilirubin level. This Dataset is available mostly for educational purposes and has been employed in machine learning research since the 2000s.

HiRID Preprocessed
PhysioNet

High time-resolution ICU dataset is a freely accessible critical care dataset containing data from almost 34 thousand patients admitted to the Department of Intensive Care Medicine (ICU) of the Bern University Hospital in Switzerland. HiRID has a high time resolution of registered data, most importantly for bedside monitoring, with most parameters recorded every 2 minutes. In this study, we select only variables included in preprocessed data provided by the authors.

ILPD
UCI
OpenML
Kaggle

Dataset was collected to detect patients with liver disease. Data comes from Andhra Pradesh in India. This dataset contains information about 583 patients and 11 variables.

metaMIMIC
PhysioNet

Dataset extracted from the MIMIC-IV database. It contains a collection of 12 binary classification tasks of occurrence-specific diseases reported as ICD codes. The MIMIC-IV database is the most common resource of high-volume EHR data.

Pima Indians Diabetes
OpenML
Kaggle

Originally, the Dataset came from the National Institute of Diabetes and Digestive and Kidney Diseases, but data was restricted because of ethical guidelines. The objective of the experiment is to predict whether a patient has diabetes based on certain diagnostic measurements. This dataset is one of the most popular data used to introduce machine learning methods.

Thyroid Disease
UCI
OpenML
Kaggle

This dataset was created by combining 6 different sources. All of them were collected in Australia. The dataset is used to identify prognostic factors in thyroid disease among 30 different features. Among them is information from blood tests but also from the patient's interview.

Cardiovascular Study
Kaggle

Open source data from a cardiovascular study on residents of Framingham, Massachusetts. The classification goal is to assess the 10-year patient's risk of future coronary heart disease (CHD). The dataset contains 238 instances and 16 variables, including demographic data, survey information, and a few EHR-based fields.

Diabetes Health Indicators
Kaggle

The dataset includes 253 680 survey responses from the Behavioral Risk Factor Surveillance System (BRFSS) from 2015. This is an example of an annually collected health-related telephone survey published since 1984. The original data describes over 300 variables, but the cleaned version contains 22 features.

Heart Disease Indicators
Kaggle

The dataset includes 253 680 survey responses from the Behavioral Risk Factor Surveillance System (BRFSS) from 2015. This is an example of an annually collected health-related telephone survey published since 1984. The original data describes over 300 variables, but the cleaned version contains 22 features.

Stroke Prediction
Kaggle

The dataset describes 5 110 instances with 12 variables used to predict whether a patient is likely to get a stroke. This data has no credentialed resources and was made available for educational purposes. Variables include gender, age, information about comorbidities, and smoking status.

EHR

Survey