import pandas as pd
= pd.read_parquet("https://notes.cs307.org/data/heart.parquet") heart
Heart Disease
Predicting Angiographic Disease Status
Introduction
This page presents information about the Heart Disease dataset which will be used throughout the CS 307 notes.
Source
The Heart Disease data was accessed through the UCI Machine Learning Repository.
The data was collected from the four following locations:
- Cleveland Clinic Foundation
- Hungarian Institute of Cardiology, Budapest
- V.A. Medical Center, Long Beach, CA
- University Hospital, Zurich, Switzerland
The contributors of the data have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. They are:
- Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
- University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
- University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
- V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
Heart Disease
Heart disease, particularly coronary artery disease, remains one of the leading causes of death worldwide. Coronary artery disease occurs when the major blood vessels that supply the heart with blood, oxygen, and nutrients become damaged or diseased. The buildup of cholesterol-containing deposits (plaques) in the coronary arteries and atherosclerosis are common causes of this narrowing.
Angiography is the gold standard diagnostic procedure for detecting coronary artery disease. During coronary angiography, a special dye is injected into the bloodstream, and X-ray imaging is used to visualize the coronary arteries. This procedure allows physicians to see the exact location and severity of blockages in the arteries. The diagnosis is typically based on the percentage of diameter narrowing in the major vessels. A narrowing greater than 50% in one or more vessels will be considered an indication of the presence of significant heart disease in this data.
While angiography provides definitive diagnosis, it is an invasive and expensive procedure. Therefore, physicians often use non-invasive clinical measurements and tests to assess a patient’s risk before recommending angiography. These include demographic factors (age, sex), physical measurements (blood pressure, cholesterol), and diagnostic tests (electrocardiogram results, echocardiogram results, exercise stress tests). The goal is to predict which patients are most likely to have significant coronary artery disease and would benefit from further investigation.
Data Dictionary
The Heart Disease dataset includes clinical measurements from patients at four medical institutions. The data includes demographic information, physical measurements, and test results used to diagnose the presence (and severity) of heart disease based on angiographic narrowing of coronary arteries.
age
- an integer denoting the patient’s age in years
sex
- a categorical variable denoting the patient’s sex
1
: male,0
: female
cp
- a categorical variable denoting chest pain type
1
: typical angina2
: atypical angina3
: non-anginal pain4
: asymptomatic
trestbps
- an integer denoting resting blood pressure (mm Hg) on admission to the hospital
chol
- an integer denoting serum cholesterol in mg/dl
fbs
- a categorical variable denoting fasting blood sugar > 120 mg/dl
1
: true0
: false
restecg
- a categorical variable denoting resting electrocardiographic results
0
: normal1
: ST-T wave abnormality2
: showing probable or definite left ventricular hypertrophy
thalach
- an integer denoting maximum heart rate achieved
exang
- a categorical variable denoting exercise-induced angina
1
: yes0
: no
oldpeak
- a number denoting ST depression induced by exercise relative to rest
slope
- a categorical variable denoting the slope of the peak exercise ST segment:
1
: up-sloping2
: flat3
: down-sloping
ca
- an integer denoting the number of major vessels (
0
-3
) colored by fluoroscopy
thal
- a categorical variable denoting thalassemia:
3
: normal6
: fixed defect7
: reversible defect
location
- a categorical variable denoting the data collection location
cl
: Cleveland Clinic Foundationch
: University Hospital, Zurich, Switzerlandhu
: Hungarian Institute of Cardiology, Budapestva
: V.A. Medical Center, Long Beach, CA
num
- a categorical variable denoting the diagnosis of heart disease (angiographic disease status)
v0
: no presence (no major vessels with more than 50% diameter narrowing)v1
: presence (more than 50% diameter narrowing in 1 major vessel)v2
: presence (more than 50% diameter narrowing in 2 major vessels)v3
: presence (more than 50% diameter narrowing in 3 major vessels)v4
: presence (more than 50% diameter narrowing in 4 major vessels)
Data in Python
The Heart Disease data is made available for download as a Parquet file.
This data can be loaded directly from the web.
heart
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | location | num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63.00 | 1.00 | 1.00 | 145.00 | 233.00 | 1.00 | 2.00 | 150.00 | 0.00 | 2.30 | 3.00 | 0.00 | 6.00 | cl | v0 |
1 | 67.00 | 1.00 | 4.00 | 160.00 | 286.00 | 0.00 | 2.00 | 108.00 | 1.00 | 1.50 | 2.00 | 3.00 | 3.00 | cl | v2 |
2 | 67.00 | 1.00 | 4.00 | 120.00 | 229.00 | 0.00 | 2.00 | 129.00 | 1.00 | 2.60 | 2.00 | 2.00 | 7.00 | cl | v1 |
3 | 37.00 | 1.00 | 3.00 | 130.00 | 250.00 | 0.00 | 0.00 | 187.00 | 0.00 | 3.50 | 3.00 | 0.00 | 3.00 | cl | v0 |
4 | 41.00 | 0.00 | 2.00 | 130.00 | 204.00 | 0.00 | 2.00 | 172.00 | 0.00 | 1.40 | 1.00 | 0.00 | 3.00 | cl | v0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
915 | 54.00 | 0.00 | 4.00 | 127.00 | 333.00 | 1.00 | 1.00 | 154.00 | 0.00 | 0.00 | NaN | NaN | NaN | va | v1 |
916 | 62.00 | 1.00 | 1.00 | NaN | 139.00 | 0.00 | 1.00 | NaN | NaN | NaN | NaN | NaN | NaN | va | v0 |
917 | 55.00 | 1.00 | 4.00 | 122.00 | 223.00 | 1.00 | 1.00 | 100.00 | 0.00 | 0.00 | NaN | NaN | 6.00 | va | v2 |
918 | 58.00 | 1.00 | 4.00 | NaN | 385.00 | 1.00 | 2.00 | NaN | NaN | NaN | NaN | NaN | NaN | va | v0 |
919 | 62.00 | 1.00 | 2.00 | 120.00 | 254.00 | 0.00 | 2.00 | 93.00 | 1.00 | 0.00 | NaN | NaN | NaN | va | v1 |
920 rows × 15 columns
Data Development
The following code was used to create the data frame loaded above.
import pandas as pd
import numpy as np
# read in each "processed" dataset from UCI
= "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/"
base_url = pd.read_csv(f"{base_url}processed.cleveland.data", header=None)
hd_cl = pd.read_csv(f"{base_url}processed.hungarian.data", header=None)
hd_hu = pd.read_csv(f"{base_url}processed.switzerland.data", header=None)
hd_ch = pd.read_csv(f"{base_url}processed.va.data", header=None)
hd_va
# add location variable for each dataset
"location"] = "ch"
hd_ch["location"] = "cl"
hd_cl["location"] = "hu"
hd_hu["location"] = "va"
hd_va[
# combine the four locations into one dataset
= pd.concat([hd_cl, hd_ch, hd_hu, hd_va], ignore_index=True)
heart
# add column names
= [
heart.columns "age",
"sex",
"cp",
"trestbps",
"chol",
"fbs",
"restecg",
"thalach",
"exang",
"oldpeak",
"slope",
"ca",
"thal",
"num",
"location",
]
# reorder columns to place location before num
= [col for col in heart.columns if col not in ["location", "num"]] + ["location", "num"]
cols = heart[cols]
heart
# switch "?" to a missing indicator
= heart.replace("?", np.nan)
heart
# convert should-be numeric columns to a numeric type (float64)
= ["age", "trestbps", "chol", "thalach", "oldpeak"]
numeric_cols for col in numeric_cols:
= pd.to_numeric(heart[col], errors="coerce")
heart[col]
# convert should-be integer columns to a numeric type (float64)
= ["sex", "cp", "fbs", "restecg", "exang", "slope", "ca", "thal", "num"]
int_cols for col in int_cols:
= pd.to_numeric(heart[col], errors="coerce")
heart[col]
# rename response variable values
"num"] = heart["num"].map({0: "v0", 1: "v1", 2: "v2", 3: "v3", 4: "v4"}) heart[
This code can be used as an alternative to the above loading procedures.