Heart Disease

Predicting Angiographic Disease Status

Introduction

This page presents information about the Heart Disease dataset which will be used throughout the CS 307 notes.

Source

The Heart Disease data was accessed through the UCI Machine Learning Repository.

Documentation: UCI Machine Learning Repository

The data was collected from the four following locations:

Cleveland Clinic Foundation
Hungarian Institute of Cardiology, Budapest
V.A. Medical Center, Long Beach, CA
University Hospital, Zurich, Switzerland

The contributors of the data have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. They are:

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

Heart Disease

Heart disease, particularly coronary artery disease, remains one of the leading causes of death worldwide. Coronary artery disease occurs when the major blood vessels that supply the heart with blood, oxygen, and nutrients become damaged or diseased. The buildup of cholesterol-containing deposits (plaques) in the coronary arteries and atherosclerosis are common causes of this narrowing.

Angiography is the gold standard diagnostic procedure for detecting coronary artery disease. During coronary angiography, a special dye is injected into the bloodstream, and X-ray imaging is used to visualize the coronary arteries. This procedure allows physicians to see the exact location and severity of blockages in the arteries. The diagnosis is typically based on the percentage of diameter narrowing in the major vessels. A narrowing greater than 50% in one or more vessels will be considered an indication of the presence of significant heart disease in this data.

While angiography provides definitive diagnosis, it is an invasive and expensive procedure. Therefore, physicians often use non-invasive clinical measurements and tests to assess a patient’s risk before recommending angiography. These include demographic factors (age, sex), physical measurements (blood pressure, cholesterol), and diagnostic tests (electrocardiogram results, echocardiogram results, exercise stress tests). The goal is to predict which patients are most likely to have significant coronary artery disease and would benefit from further investigation.

Data Dictionary

The Heart Disease dataset includes clinical measurements from patients at four medical institutions. The data includes demographic information, physical measurements, and test results used to diagnose the presence (and severity) of heart disease based on angiographic narrowing of coronary arteries.

age

an integer denoting the patient’s age in years

sex

a categorical variable denoting the patient’s sex
- 1: male,
- 0: female

cp

a categorical variable denoting chest pain type
- 1: typical angina
- 2: atypical angina
- 3: non-anginal pain
- 4: asymptomatic

trestbps

an integer denoting resting blood pressure (mm Hg) on admission to the hospital

chol

an integer denoting serum cholesterol in mg/dl

fbs

a categorical variable denoting fasting blood sugar > 120 mg/dl
- 1: true
- 0: false

restecg

a categorical variable denoting resting electrocardiographic results
- 0: normal
- 1: ST-T wave abnormality
- 2: showing probable or definite left ventricular hypertrophy

thalach

an integer denoting maximum heart rate achieved

exang

a categorical variable denoting exercise-induced angina
- 1: yes
- 0: no

oldpeak

a number denoting ST depression induced by exercise relative to rest

slope

a categorical variable denoting the slope of the peak exercise ST segment:
- 1: up-sloping
- 2: flat
- 3: down-sloping

ca

an integer denoting the number of major vessels (0 - 3) colored by fluoroscopy

thal

a categorical variable denoting thalassemia:
- 3: normal
- 6: fixed defect
- 7: reversible defect

location

a categorical variable denoting the data collection location
- cl: Cleveland Clinic Foundation
- ch: University Hospital, Zurich, Switzerland
- hu: Hungarian Institute of Cardiology, Budapest
- va: V.A. Medical Center, Long Beach, CA

num

a categorical variable denoting the diagnosis of heart disease (angiographic disease status)
- v0: no presence (no major vessels with more than 50% diameter narrowing)
- v1: presence (more than 50% diameter narrowing in 1 major vessel)
- v2: presence (more than 50% diameter narrowing in 2 major vessels)
- v3: presence (more than 50% diameter narrowing in 3 major vessels)
- v4: presence (more than 50% diameter narrowing in 4 major vessels)

Data in Python

The Heart Disease data is made available for download as a Parquet file.

heart.parquet

This data can be loaded directly from the web.

import pandas as pd
heart = pd.read_parquet("https://notes.cs307.org/data/heart.parquet")

heart

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	location	num
0	63.00	1.00	1.00	145.00	233.00	1.00	2.00	150.00	0.00	2.30	3.00	0.00	6.00	cl	v0
1	67.00	1.00	4.00	160.00	286.00	0.00	2.00	108.00	1.00	1.50	2.00	3.00	3.00	cl	v2
2	67.00	1.00	4.00	120.00	229.00	0.00	2.00	129.00	1.00	2.60	2.00	2.00	7.00	cl	v1
3	37.00	1.00	3.00	130.00	250.00	0.00	0.00	187.00	0.00	3.50	3.00	0.00	3.00	cl	v0
4	41.00	0.00	2.00	130.00	204.00	0.00	2.00	172.00	0.00	1.40	1.00	0.00	3.00	cl	v0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
915	54.00	0.00	4.00	127.00	333.00	1.00	1.00	154.00	0.00	0.00	NaN	NaN	NaN	va	v1
916	62.00	1.00	1.00	NaN	139.00	0.00	1.00	NaN	NaN	NaN	NaN	NaN	NaN	va	v0
917	55.00	1.00	4.00	122.00	223.00	1.00	1.00	100.00	0.00	0.00	NaN	NaN	6.00	va	v2
918	58.00	1.00	4.00	NaN	385.00	1.00	2.00	NaN	NaN	NaN	NaN	NaN	NaN	va	v0
919	62.00	1.00	2.00	120.00	254.00	0.00	2.00	93.00	1.00	0.00	NaN	NaN	NaN	va	v1

920 rows × 15 columns

Data Development

The following code was used to create the data frame loaded above.

import pandas as pd
import numpy as np

# read in each "processed" dataset from UCI
base_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/"
hd_cl = pd.read_csv(f"{base_url}processed.cleveland.data", header=None)
hd_hu = pd.read_csv(f"{base_url}processed.hungarian.data", header=None)
hd_ch = pd.read_csv(f"{base_url}processed.switzerland.data", header=None)
hd_va = pd.read_csv(f"{base_url}processed.va.data", header=None)

# add location variable for each dataset
hd_ch["location"] = "ch"
hd_cl["location"] = "cl"
hd_hu["location"] = "hu"
hd_va["location"] = "va"

# combine the four locations into one dataset
heart = pd.concat([hd_cl, hd_ch, hd_hu, hd_va], ignore_index=True)

# add column names
heart.columns = [
    "age",
    "sex",
    "cp",
    "trestbps",
    "chol",
    "fbs",
    "restecg",
    "thalach",
    "exang",
    "oldpeak",
    "slope",
    "ca",
    "thal",
    "num",
    "location",
]

# reorder columns to place location before num
cols = [col for col in heart.columns if col not in ["location", "num"]] + ["location", "num"]
heart = heart[cols]

# switch "?" to a missing indicator
heart = heart.replace("?", np.nan)

# convert should-be numeric columns to a numeric type (float64)
numeric_cols = ["age", "trestbps", "chol", "thalach", "oldpeak"]
for col in numeric_cols:
    heart[col] = pd.to_numeric(heart[col], errors="coerce")

# convert should-be integer columns to a numeric type (float64)
int_cols = ["sex", "cp", "fbs", "restecg", "exang", "slope", "ca", "thal", "num"]
for col in int_cols:
    heart[col] = pd.to_numeric(heart[col], errors="coerce")

# rename response variable values
heart["num"] = heart["num"].map({0: "v0", 1: "v1", 2: "v2", 3: "v3", 4: "v4"})

This code can be used as an alternative to the above loading procedures.