Heart Disease Prediction with ML methods

Photo by rawpixel on Unsplash

The World Health Organization has estimated that 12 million deaths occur worldwide every year due to the heart disease. In the United States, cardiovascular diseases are the number 1 cause of death in adults. Early diagnosis is the key to prevent heart related deaths. Potential cardiovascular patients are often sent for multiple tests, and these test results help the doctors to make a diagnosis. The accuracy of the diagnosis often relies on the individual doctor’s knowledge and experiences. In order to improve the accuracy of the diagnosis, a group of medical researchers collected patient data from 4 national hospitals; two in the United States, and two in Europe. The data set HD.xlsx includes the patients age, gender, 11 test results, and final diagnosis. The patient data from the two US hospitals and two European hospitals are saved in the 4 tabs, labeled “US1”, “US2”, “EU1”, “EU2”, respectively. The dictionary of all variables is saved in the tab “Dictionary”. Assume that no patient within is correlated with other patients within hospital.

In order to find the factors that significantly contribute to the accuracy of diagnosis, I conducted two approaches sequentially.
a) Detect the presence of CVD: dichotomize the severity variable and consider the presence of narrowing vessel as response variable.

b) Predict the severity of CVD: build a model with the severity of heart disease as response variable.

I developed several statistical models including GLM, KNN, SVM, and SDA to use the medical test results to help predict the likelihood of the presence of heart disease as well as its severity. The overall prediction accuracy is 84.78% in detecting the presence of CVD, and 64.1% in detecting the severity of CVD.

I also took a consideration on the costs of different test, and revise the final model with 6 predictors to balance between cost and effectivceness. The prediction accuracy for the revided model is 80.43% in predicting the presence of CVD, and 53.3% in predicting the severity of CVD.

Test Cost
cp Immediate results, no additional cost
thestbps Immediate results, no additional cost
chol $7.27, need one day laboratory work
fbs $5.20, need one day laboratory work
restecg $15.50, need one day laboratory work
thalach $102.90, need one day laboratory work
exang $87.30, need one day laboratory work
oldpeak $87.30, need one day laboratory work
slope $87.30, need one day laboratory work
ca $100.90, need one day laboratory work
thal $102.90, need one day laboratory work
Tianran Zhang
Tianran Zhang
A professional data scientist, an unprofessional hiker, cooker, video creator, day dreamer, and life-long learner.

sad().stop(); beAwesome();