Heart Disease Prediction Model(5+ models)

Data Overview

This database contains 14 attributes(features).The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.Target: In the end we will classify given data in the range 0 to 4 clas of diease.

The performance metric for our model:

  • Both precision and recall is important
  • Confusion matrix

Simple Intuition of the whole article

  1. Importing all the needed libraries
  2. Reading the dataset
  3. Exploratory Data Analysis
  4. Data Processing
  5. Applying Machine learning model like (SVM, DT, KNN, XGBOOST, LR, RF)
  6. Feature Importance
  7. Comparing all the model (log loss).
  8. Conclusion

1. Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib.cm import rainbow
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import plotly.graph_objs as go
import plotly.tools as tls
import os
import gc
import refrom sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

2. Reading dataset

dataset=pd.read_csv('../input/heart-disease-uci/heart.csv')
dataset.head()

dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age 303 non-null int64
sex 303 non-null int64
cp 303 non-null int64
trestbps 303 non-null int64
chol 303 non-null int64
fbs 303 non-null int64
restecg 303 non-null int64
thalach 303 non-null int64
exang 303 non-null int64
oldpeak 303 non-null float64
slope 303 non-null int64
ca 303 non-null int64
thal 303 non-null int64
target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

3. Exploratory Data Analysis

print('~> Have not heart disease (target = 0):\n   {}%'.format(100 - round(dataset['target'].mean()*100, 2)))
print('\n~> Have heart disease (target= 1):\n {}%'.format(round(dataset['target'].mean()*100, 2)))
~> Have not heart disease (target = 0):
45.54%
~> Have heart disease (target= 1):
54.46%
rcParams['figure.figsize'] = 20, 14
plt.matshow(dataset.corr())
plt.yticks(np.arange(dataset.shape[1]), dataset.columns)
plt.xticks(np.arange(dataset.shape[1]), dataset.columns)
plt.colorbar()
plt.figure(figsize=(6,6))
dataset.groupby("target").count().plot.bar()
dataset.hist()
import seaborn as sns
sns.pairplot(dataset, palette='rainbow')
import seaborn as sns
plt.figure(figsize=(12, 8))
plt.subplot(1,2,1)
sns.violinplot(x = 'target', y = 'chol', data = dataset[0:])
plt.show()
sns.lmplot(x='chol',y='target',data=dataset)
plt.figure(figsize=(5, 5))
unique_variations = dataset['chol'].value_counts()
print('Number of Unique chol:', unique_variations.shape[0])
# the top 10 variations that occured most
print(unique_variations.head(10))
s = sum(unique_variations.values);
h = unique_variations.values/s;
plt.plot(h, label="Histrogram of Variations")
plt.xlabel('Index of a chol')
plt.ylabel('Number of Occurances')
plt.legend()
plt.grid()
plt.show()
c = np.cumsum(h)
print(c)
plt.figure(figsize=(5, 5))
plt.plot(c,label='Cumulative distribution of Variations')
plt.grid()
plt.legend()
plt.show()
Number of Unique chol: 152
234 6
204 6
197 6
269 5
212 5
254 5
226 4
243 4
240 4
239 4
Name: chol, dtype: int64
rcParams['figure.figsize'] = 8,6
plt.bar(dataset['target'].unique(), dataset['target'].value_counts(), color = ['red', 'green'])
plt.xticks([0, 1])
plt.xlabel('Target Classes')
plt.ylabel('Count')
plt.title('Count of each Target Class')

4. Data Processing

nan_rows = dataset[dataset.isnull().any(1)]
print (nan_rows)
Empty DataFrame
Columns: [age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal, target]
Index: []
categorical_feature_mask = dataset.dtypes==object
# filter categorical columns using mask and turn it into alist
categorical_cols = dataset.columns[categorical_feature_mask].tolist()
print(categorical_cols)
print("number of categorical features ",len(categorical_cols))
[]
number of categorical features 0
standardScaler = StandardScaler()
columns_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
dataset[columns_to_scale] = standardScaler.fit_transform(dataset[columns_to_scale])

5. Applying Machine Learning Model

y = dataset['target']
X = dataset.drop(['target'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)

Performance metrics

# This function plots the confusion matrices given y_i, y_i_hat.
def plot_confusion_matrix(test_y, predict_y):
C = confusion_matrix(test_y, predict_y)
# C = 9,9 matrix, each cell (i,j) represents number of points of class i are predicted class j

A =(((C.T)/(C.sum(axis=1))).T)


B =(C/C.sum(axis=0))

plt.figure(figsize=(20,4))

labels = [1,2]
# representing A in heatmap format
cmap=sns.light_palette("blue")
plt.subplot(1, 3, 1)
sns.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Confusion matrix")

plt.subplot(1, 3, 2)
sns.heatmap(B, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Precision matrix")

plt.subplot(1, 3, 3)
# representing B in heatmap format
sns.heatmap(A, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Recall matrix")

plt.show()

5.1 K Neighbors Classifier

knn_scores = []
for k in range(1,21):
knn_classifier = KNeighborsClassifier(n_neighbors = k)
knn_classifier.fit(X_train, y_train)
knn_scores.append(knn_classifier.score(X_test, y_test))
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics.classification import accuracy_score, log_loss
plt.figure(figsize=(8,5))
plt.plot([k for k in range(1, 21)], knn_scores, color = 'red')
for i in range(1,21):
plt.text(i, knn_scores[i-1], (i, knn_scores[i-1]))
plt.xticks([i for i in range(1, 21)])
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Scores')
plt.title('K Neighbors Classifier scores for different K values')
#############we get k best value that is 8
#now again train our model
clf=KNeighborsClassifier(n_neighbors=14)
clf.fit(X_train,y_train)
clf = CalibratedClassifierCV(clf, method="sigmoid")
clf.fit(X_train, y_train)
predict_y=clf.predict_proba(X_test)
print("The log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
#########Plot confusion atrix
print(len(predict_y))
#print(len(y_test))
plot_confusion_matrix(y_test, clf.predict(X_test))
The log loss is: 0.4194913074929284
100

5.2 Support Vector Classifier

svc_scores = []
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
for i in range(len(kernels)):
svc_classifier = SVC(kernel = kernels[i])
svc_classifier.fit(X_train, y_train)
svc_scores.append(svc_classifier.score(X_test, y_test))
colors = rainbow(np.linspace(0, 1, len(kernels)))
plt.bar(kernels, svc_scores, color = colors)
for i in range(len(kernels)):
plt.text(i, svc_scores[i], svc_scores[i])
plt.xlabel('Kernels')
plt.ylabel('Scores')
plt.title('Support Vector Classifier scores for different kernels')
##################### with best kernel
clf=SVC(kernel ='rbf')
#clf.fit(X_train,y_train)
clf = CalibratedClassifierCV(clf, method="sigmoid")
clf.fit(X_train, y_train)
predict_y=clf.predict_proba(X_test)
print("The log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
#########Plot confusion atrix
print(len(predict_y))
#print(len(y_test))
plot_confusion_matrix(y_test, clf.predict(X_test))
The log loss is: 0.4152790470358795
100

5.3 Decision Tree Classifier

dt_scores = []
for i in range(1, len(X.columns) + 1):
dt_classifier = DecisionTreeClassifier(max_features = i, random_state = 0)
dt_classifier.fit(X_train, y_train)
dt_scores.append(dt_classifier.score(X_test, y_test))
plt.plot([i for i in range(1, len(X.columns) + 1)], dt_scores, color = 'green')
for i in range(1, len(X.columns) + 1):
plt.text(i, dt_scores[i-1], (i, dt_scores[i-1]))
plt.xticks([i for i in range(1, len(X.columns) + 1)])
plt.xlabel('Max features')
plt.ylabel('Scores')
plt.title('Decision Tree Classifier scores for different number of maximum features')
clf=DecisionTreeClassifier(max_features = 10, random_state = 0)
clf.fit(X_train,y_train)
clf = CalibratedClassifierCV(clf, method="sigmoid")
clf.fit(X_train, y_train)
predict_y=clf.predict_proba(X_test)
print("The log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
#########Plot confusion atrix
print(len(predict_y))
#print(len(y_test))
plot_confusion_matrix(y_test, clf.predict(X_test))
The log loss is: 0.5341998825924514
100

5.4 Random Forest Classifier

rf_scores = []
estimators = [10, 100, 200, 500, 1000]
for i in estimators:
rf_classifier = RandomForestClassifier(n_estimators = i, random_state = 0)
rf_classifier.fit(X_train, y_train)
rf_scores.append(rf_classifier.score(X_test, y_test))
colors = rainbow(np.linspace(0, 1, len(estimators)))
plt.bar([i for i in range(len(estimators))], rf_scores, color = colors, width = 0.8)
for i in range(len(estimators)):
plt.text(i, rf_scores[i], rf_scores[i])
plt.xticks(ticks = [i for i in range(len(estimators))], labels = [str(estimator) for estimator in estimators])
plt.xlabel('Number of estimators')
plt.ylabel('Scores')
plt.title('Random Forest Classifier scores for different number of estimators')
clf=RandomForestClassifier(n_estimators = 200, random_state = 0)
clf.fit(X_train,y_train)
clf = CalibratedClassifierCV(clf, method="sigmoid")
clf.fit(X_train, y_train)
predict_y=clf.predict_proba(X_test)
print("The log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
#########Plot confusion atrix
print(len(predict_y))
#print(len(y_test))
plot_confusion_matrix(y_test, clf.predict(X_test))
The log loss is: 0.39348441062524186
100

5.5 XGBOOST -ensemble technique

import xgboost as xgb
params = {}
params['objective'] = 'binary:logistic'
params['eval_metric'] = 'logloss'
params['eta'] = 0.02
params['max_depth'] = 4
d_train = xgb.DMatrix(X_train, label=y_train)
d_test = xgb.DMatrix(X_test, label=y_test)
watchlist = [(d_train, 'train'), (d_test, 'valid')]bst = xgb.train(params, d_train, 400, watchlist, early_stopping_rounds=20, verbose_eval=10)xgdmat = xgb.DMatrix(X_train,y_train)
predict_y = bst.predict(d_test)
print("The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
print(len(predict_y))
#print(len(y_test))
plot_confusion_matrix(y_test, clf.predict(X_test))
[0] train-logloss:0.681559 valid-logloss:0.684108
Multiple eval metrics have been passed: 'valid-logloss' will be used for early stopping.
Will train until valid-logloss hasn't improved in 20 rounds.
[10] train-logloss:0.585605 valid-logloss:0.60957
[20] train-logloss:0.512725 valid-logloss:0.557567
[30] train-logloss:0.455003 valid-logloss:0.519062
[40] train-logloss:0.407503 valid-logloss:0.49514
[50] train-logloss:0.368371 valid-logloss:0.478558
[60] train-logloss:0.336583 valid-logloss:0.466736
[70] train-logloss:0.308998 valid-logloss:0.460353
[80] train-logloss:0.285831 valid-logloss:0.453601
[90] train-logloss:0.265688 valid-logloss:0.448218
[100] train-logloss:0.248369 valid-logloss:0.441384
[110] train-logloss:0.233118 valid-logloss:0.438462
[120] train-logloss:0.219548 valid-logloss:0.43408
[130] train-logloss:0.208009 valid-logloss:0.434458
[140] train-logloss:0.197807 valid-logloss:0.43351
[150] train-logloss:0.189129 valid-logloss:0.43454
Stopping. Best iteration:
[134] train-logloss:0.203719 valid-logloss:0.432805
The test log loss is: 0.43494783621281385
100

5.6 Logistic Regression

alpha = [10 ** x for x in range(-5, 2)] # hyperparam for SGD classifier.log_error_array=[]
for i in alpha:
clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)
clf.fit(X_train, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_test)
log_error_array.append(log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
print('For values of alpha = ', i, "The log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
fig, ax = plt.subplots()
ax.plot(alpha, log_error_array,c='g')
for i, txt in enumerate(np.round(log_error_array,3)):
ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()
best_alpha = np.argmin(log_error_array)
clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)
clf.fit(X_train, y_train)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
predict_y = sig_clf.predict_proba(X_train)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))
predicted_y =np.argmax(predict_y,axis=1)
print("Total number of data points :", len(predicted_y))
plot_confusion_matrix(y_test, predicted_y)
For values of alpha = 1e-05 The log loss is: 0.44012529267994355
For values of alpha = 0.0001 The log loss is: 0.42826340471379937
For values of alpha = 0.001 The log loss is: 0.4560308526539795
For values of alpha = 0.01 The log loss is: 0.41489190299443834
For values of alpha = 0.1 The log loss is: 0.40163904302879466
For values of alpha = 1 The log loss is: 0.40556761746085845
For values of alpha = 10 The log loss is: 0.4124918369135891
For values of best alpha =  0.1 The train log loss is: 0.37686236801831313
For values of best alpha = 0.1 The test log loss is: 0.40163904302879466
Total number of data points : 100

6. Feature Importance

clf=RandomForestClassifier(n_estimators = 500, random_state = 0)
clf.fit(X_train,y_train)
features = X_train.columns
importances = clf.feature_importances_
indices = (np.argsort(importances))[-25:]
plt.figure(figsize=(10,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='r', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

7. Comparing Models

error_rate=np.array([0.3828,0.4158,0.3972,0.5409,0.3803,0.4135])
plt.figure(figsize=(16,5))
print(error_rate)
#plt.scatter(error_rate,range(1,7))
#seed = 7
# prepare models
models = ['LR','XGBOOST','RF','DT','SVM','KNN']
plt.xlabel(models)
plt.plot(error_rate)
lowest_loss=np.argmin(error_rate)
print("lowest logg loss : ",min(error_rate))
print(models[lowest_loss])
[0.3828 0.4158 0.3972 0.5409 0.3803 0.4135]
lowest logg loss : 0.3803
SVM

8.Conclusion

--

--

--

Software Engineer 👨‍💻

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Techniques of Feature Selection in Machine Learning

Natural Language Processing Anyone Can Use

A Guide to Text Annotation — the Key to Understanding Language

A Start-to-Finish Guide to Building Deep Neural Networks in Keras

WomanLife: Deep Learning for the detection and classification of breast cancer

Automatic Deep Learning hyper Parameters search strategy

It’s Deep Learning Times: A New Frontier of Data

Step-by-Step Deep Learning Tutorial to Build your own Video Classification Model

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Akash Kumar

Akash Kumar

Software Engineer 👨‍💻

More from Medium

Hiring Prediction using AI / ML

KNIME Workflow for Hiring Prediction

Churn Analysis with Feature Selection -Genetic Algorithm

Credit Card Transactions Fraud Prediction with Machine Learning Algorithms

LANL Earthquake Prediction