http://annalsofrscb.ro 904
A Cardiovascular Disease Prediction using Machine Learning Algorithms
Rubini PE1, Dr.C.A.Subasini2, Dr.A.Vanitha Katharine3, V.Kumaresan4, S.GowdhamKumar5, T.M. Nithya6
1Assistant Professor, Department of Computer Science and Engineering, CMR Institute of Technology, Bengaluru.
2 Associate Professor, Department of Computer Science and Engineering, St. Joseph‟s Institute of Technology, Chennai-119.
3Associate Professor, Department of Computer Applications, PSNA College of Engineering and Technology, Dindugul.
4Assistant Professor (Senior Grade), Department of Electrical and Electronics Engineering, Kongu Engineering College (Autonomous), Perundurai, Erode-638060.Email
5Training Officer, PSG Industrial Institute (PSG COLLEGE OF TECHNOLOGY), Peelamedu, Coimbatore-641041
6 Assistant Professor, Department of Computer Science and Engineering, K. Ramakrishnan College of Engineering, Trichy.
ABSTRACT
Heart Diseases have shown a tremendous hit in this modern age. As doctors deal with precious human life, it is very important for them to be right their results. Thus, an application was developed which can predict the vulnerability of heart disease, given basic symptoms like age, gender, pulse rate, resting blood pressure, cholesterol, fasting blood sugar, resting electrocardiographic results, exercise induced angina, ST depression ST segment the slope at peak exercise, number of major vessels colored by fluoroscopy and maximum heart rate achieved. This can be used by doctors to re heck and confirm on their patient‟s condition. In the existing surveys they have considered only 10 features for prediction, but in this proposed research work 14 necessary features were taken into consideration. Also, this paper presents a comparative analysis of machine learning techniques like Random Forest (RF), Logistic Regression, Support Vector Machine (SVM), and Naïve Bayes in the classification of cardiovascular disease. By the comparative analysis, machine learning algorithm Random Forest has proven to be the most accurate and reliable algorithm and hence used in the proposed system. This system also provides the relation between diabetes and how much it influences heart disease
Keywords:
Heart disease; Machine learning algorithms; Random Forest; Logistic regression; Support Vector Machine;
Naïve Bayes; Diabetes Influence
1.Introduction
Coronary illness has the biggest level of passing on the planet. In 2012, around 17.5 million individuals kicked the bucket from coronary illness, implying that it comprises of the 31% of every single worldwide passing. Besides, coronary illness loss of life rises each year. It is relied upon to develop more than 23.6 million by 2030. The exploration from the January 2017 demonstrated that the main source of death worldwide is cardiovascular infections. The cardiovascular malady is considered as a world's biggest killer and is currently taking the top position in the record of ten reasons for passing in the previous 15 years and in 2015 was numeration for fifteen million passing. Various human lives could be spared by diagnosing on schedule. Along these lines, diagnosing the syndrome is significant and an exceptionally muddled undertaking. Mechanizing this procedure would conquer the issues with the diagnosis. The utilization of AI in ailment arrangement is normal and researchers are especially fascinated in the advancement of such frameworks for simpler following and analysis of cardiovascular diseases.
Since ML permits PC projects to ponder from information, building up a model to perceive ordinary examples and having the option to settle on choices dependent on assembled data, it doesn't have hitches with the deficiency of utilized medicinal database. The proposed model is to amass significant information relating all components identified with coronary illness and parameters impacting it, train the information according to the proposed calculation of AI and
http://annalsofrscb.ro 905
foresee how solid is there a probability for a patient to get a coronary illness. The relationship with the diabetes related credits is considered to set up the impact. [2]
2. Methodology
The methodology for predicting cardiovascular disease was done by using following four algorithms and the results are compared.Fig.1 describes the architecture diagram for predicting cardio vascular disease.
1. Random Forest 2. Logistic Regression 3. Naive Bayes algorithm 4. Support Vector Machines
Figure 1: Methodology to predict heart disease
A. Random Forest Algorithm
The Random Forest Algorithm is understood as a forest comprised of trees. Firstly, it creates call trees on every which way chosen knowledge samples from the dataset. It then gets the prediction from each tree and selects the most effective resolution through means voting. It is an enhancement from decision trees [3]. Some of its applications are image classification, recommendation engines and feature selection. This algorithmic rule is considered as an extremely correct and strong methodology as a result of the number of trees collaborating within the method. One amongst its many advantages is that it does not suffer from the over fitting problem. Finally, it takes the average of all the predictions from every tree, which cancels out the biases.
1. Dataset collection and pre-processing
The dataset which was used for analysis are “Framingham” obtained from Kaggle. Heart disease dataset with 14 features is obtained from UCI Machine Learning Repository [19]. Data is cleaned by replacing all the non-available values with the median of values in that column. Categorical data are assigned with numerical values.
2. Implementation
The implementation of random forest works as follows:
a. Load the heart disease dataset.
http://annalsofrscb.ro 906
b. After Preprocess, Split the heart disease dataset into train and test data with the proportion of 60:40 using Random Forest Classifier function.
c. K-Fold Cross Validation is wherever a given knowledge set is split into a K range of sections/folds wherever every fold is employed as a testing set at some purpose.
d. Train the model using train set.
e. Make predictions on the test fold.
f. Map predictions to outcomes (only possible outcomes are 1 and 0).
g. Calculate the accuracy.
Accuracy = * 100 Where,
TP- True Positive (prediction is yes, and they do have the disease.
TN-True Negative (prediction is no, and they don't have the disease.)
FP-False Positive (We predicted yes, but they don't actually have the disease. (Also known as a
"Type I error.")
FN-False Negative (We predicted no, but they actually do have the disease. (Also known as a
"Type II error.")
The accuracy obtained by using random forest algorithm is 84.81%
Figure 2: Sample Code of Random Forest
Figure 3: Accuracy result of Random Forest algorithm
B. Support Vector Machine 1. Introduction
Support Vector Machines is a classification technique which separates data values by the creation of hyper planes. Hyper planes can be of different shapes based on the spread of data, but only those points which help in differentiating between the classes are considered for classification.
http://annalsofrscb.ro 907 2. Kernel Functions
If data points are in nonlinear fashion, the kernel function makes them towards linear decision surface.
Some Kernel functions are as follows:
a. Linear Function: In these kinds of kernel the hyper plane is a straight line. Linear Kernel functions can provide best results for classifiers which have exactly two target classes.
b. Polynomial Function: In such kinds of kernel functions the hyper plane is generally a polynomial like parabola, hyperbola.
c. Radial Basis Function: Radial Basis Function is put in use when points cannot be separated in a linear fashion. The function works to bring points into a shape mostly radial/circular fashion to perform further actions.
3. Implementation
The implementation of Support Vector Machine described as follows:
a. Load the data sets and clean values, in case of no value for a particular feature inarowreplacewiththemedianvaluetherowfrom thedataset.
b. Split the data set into train and test in 60:40 ratio respectively.
c. Choosing the Kernel Function as Linear Kernel Function or Radial Basis Function.
d. Applying SVM by first creating a hyper plane with the help of test data set.
e. Calculate the accuracy using
The train data is taken and both Kernel function namely Linear Kernel Function or Radial Basis Function is applied.
Apply test data set on the trained model.
The model uses hyper plane and finds closest proximitytoeitherclassthatishavingheartdisease (yes/1) or not having heart disease(no/0).
Kernel Functions Accuracy (%) Linear Kernel Function 74.05
Radial Basis Function
(RBF) 58.577
TABLE 1- Comparison of SVM accuracies with Kernel Functions
In Table 1 the calculation accuracies for both SVM Models with RBF and Linear Function as Kernels are examined. Linear Kernel Function provides higher accuracy than RBF. This is because the problem is a two-class classifier problem. Hence a hyper plane in the form of a line would be the best way to classify such values. In comparison RBF uses a circle as hyper plane thus producing lower accuracy. The hyper plane plot for SVM for predicting heart disease is shown in Fig.4.In this the yellow plot represents patients having heart disease and purple dots represents the patients not having heart disease.
http://annalsofrscb.ro 908
Figure 4: Hyper plane and distribution of data points on either side of hyper plane for Heart Disease Prediction
C. Naïve Bayes Classification
Naïve Bayes classifier is based on probability which is mostly used in the training phase. This algorithm is used for removing the redundant data from the datasets.
1. Implementation
The implementation of Naive Bayes is as follows:
a. Extract the dataset.
b. Apply cleaning on the dataset to remove unwanted values.
c. In case any values are missing then find the median value of the column and fill the missing value.
d. Find the deterministic probability with occurrence of heart disease with respect to 14parameters.
e. Then find the conditional probability of non-occurrence of heart disease with respect to 14 parameters.
f. Train the model using this probability formula given below
= (1)
(2) (3) (4)
(5)
Where, x1- age; x2- sex; x3-cp; x4 – rest bp; x5- chol; x6- fbs; x7- rest ecg; x8- thalach;
x9 - exang;x10-oldpeak; x11-slope; x12-ca; x13-thal; x14-pulse rate.
g. As soon as the model is trained, then apply the test data set.
h. Remove the last column of the test data set which determines the person will have heart attack or not.
i. Apply the model on the test data set and extract the values.
http://annalsofrscb.ro 909
j. Compare the result between the last column and the predicted values.
k. Calculate the accuracy.
Figure 5: Working of Naïve Bayes
D. Logistic Regression
Logistic regression is a machine learning algorithm used for classification. It is based on the concept of probability. Logistic regression is used to assign observations to a discrete class.
Transforming output is done using the sigmoid logic function. The logistic regression hypothesis tends to limit the cost function in range between 0 and 1. Therefore, linear functions cannot represent as it can have a value >1 or <=0, which is not possible according to the regression hypothesis.
1. Implementation
The implementation steps for logistic regression are given a follow:
a. Obtain the probabilities: Mapping predicted values to probabilities, using the Sigmoid function.
(6)
where, y is input to the function and e is the base of natural log. Obtain the probabilities by following equations:
P = ey/ 1 + ey (7)
where P is the probability of success, and q is the probability of failure written as:
q = 1 – P= 1 – (ey/ 1 + ey) (8) on dividing, (7) / (8), we get
(9) On taking log on both sides,
(10)
where (p/1-p) is the odd ratio. When „y‟ is positive, the probability of success is more than 50%.
b. Decision Boundary-Mapping probabilities to classes
Prediction function returns probability score between 0 and 1. To assign to a discrete class, a threshold value is selected above which it is classified as class 1 or else class 2. For example, if our threshold was 0.5 and our function value was 7, it is classified as positive. For say .3,
http://annalsofrscb.ro 910
classification is negative. Logistic regression can also have multiple classes where the highest probability predicted class is considered.
2. Analysis of result:
The result can be analyzed in following ways.
a. Using Confusion Matrix: Accuracy is calculated by formula Accuracy= ((TP + TN) / (TP + TN + FP + FN)) * 100
Where TP- True Positive, TN-True Negative, FP-False Positive, FN-False Negative b. ROC curve: The receiver operating characteristic summarizes the performance when evaluating the compensations between the sensitivity and the 1-specificity. To plot ROC, assume p> 0.5. The area under the curve, indicated as an index of precision or concordance index, is a performance metric for curve. The larger the area under the curve, the better the predictive power of the model.
Figure 7. ROC Curve - Logistic Regression
Figure 8. Accuracy result of Logistic Regression 3. Result
Results from Random Forest, Support Vector Machine, Logistic Regression and naïve Bayes are analyzed, and Random Forest Algorithm has given the highest accuracy. Hence Random Forest has been implemented in the proposed system.
http://annalsofrscb.ro 911
Figure 9. Graphical Representation of Accuracy
ALGORITHM ACCURACY (%)
RANDOM FOREST 84.81
LINEAR REGRSSION 83.828
SUPPORT VECTOR
MACHINE (Using Linear Kernel
Function) 74.05
SUPPORT VECTOR
MACHINE (Using Radial Basis
Kernel Function) 58.577
Naïve Bayes 54.08401
TABLE II Comparison of Accuracies
4. Conclusion and Future Scope
Heart disease prediction which uses Machine learning algorithm provides users a prediction result if the user has heart disease. Recent advancements in technology made machine learning algorithms to evolve. In this proposed method Random Forest Algorithm was used because of its efficiency and accuracy. This algorithm is also used to find the heart disease prediction percentage by knowing the correlation details between diabetes and heart diseases. The similar prediction systems can be built by calculating correlation between heart diseases and other diseases. Also new algorithms can be used to achieve increased accuracy. Better performance is obtained with more parameter used in these algorithms.
References
[1] Jaymin Patel, Prof.Tejal Upadhyay, Dr.Samir Patel “Heart disease prediction using Machine learning and Data Mining Technique" Volume 7.Number1 Sept 2015-March 2016.
[2] Thenmozhi.K and Deepika.P, Heart Disease Prediction using classification with different decision tree techniques. International Journal of Engineering Research & General Science, Vol 2(6), pp 6-11, Oct 2014.
[3] Igor Kononenko” Machine learning for medical diagnosis: history, state of art& perspective"
http://annalsofrscb.ro 912
Elsevier -Artificial intelligence in Medicine, Volume23, Aug 2001.
[4] Gregory F. Cooper, Constantin F. Alfieris”, Richard Ambrosino, John Aronisb, Bruce G.
Buchanan, Richard Caruana', Michael J. Fine, Clark Glymour”, Geoffrey Gordon”, Barbara H. Hanusad, Janine E. Janoskyf, Christopher Meek”, Tom Mitchell”, Thomas Richardson”, Peter Spirtes” An evaluation of machine-learning methods for predicting of pneumonia mortality”-Elsevier Feb 1997
[5] Sana Bharti, Shailendra Narayan Singh" Analytical study of heart disease comparing with different algorithms": Computing, Communication & Automation (ICCCA),
2015InternationalConference.
[6] B.Dhomse Kanchan, M. Mahale Kishore “Study of Machine learning algorithms for special disease predictions using the principal of component analysis” Global Trends in Signal Processing, Information Computing and Communication (ICGTSPICC), 2016.
[7] MatjazKuka, Igor Kononenko, Cyril Groselj, Katrina Kalif, JureFettich" Analysing and improving the diagnosis of ischaemic heart disease with machine learning" Elsevier - Artificial intelligence in Medicine, Volume23, May 1999.
[8] Geert Meyfroidt, FabianGuiza, Jan Ramon, Maurice Brynooghe" Machine learning techniques to examine large patient databases"-Best practice & Reasearch Clinical Anaesthesiology, Elsevier Volume 23 (1) Mar 1, 2009.
[9] Gregory F.Cooper, ConstantinF.Aliferis, Richard Ambrosino”An evaluation of Machine learning methods for predicting pneumonia mortality”-Elsevier, 1997.
[10] Sanjay Kumar Sen” Predicting and Diagnosing of Heart Disease Using Machine Learning Algorithms”, International Journal of Engineering And Computer Science ISSN:2319- 7242Volume6Issue 6 June 2017.
[11] Abhishek Taneja” Heart Disease Prediction SystemUsing Data Mining Techniques”-Vol.6, No(4) December 2013.
[12] AnimeshHazra, Subrata Kumar Mandal, AmitGupta,Arkomita Mukherjee and Asmita Mukherjee” Heart Disease Diagnosis and Prediction Using Machine Learning and Data Mining Techniques: A Review”- Advances in Computational Sciences and Technology ISSN 0973-6107, Volume10, Number7(2017).
[13] BeantKaur, Williamjeet Singh” Review on Heart Diseases Prediction System using different Data Mining Techniques”- International Journal on Recent and Innovation Trends in Computing and Communication Volume:2 Issue:10, October 2014.Transll. J. Magn. Japan, vol. 2, pp. 740-741, August 1987.
[14] SonamNikhar, A.M. Karandikar" Prediction of Heart Disease Using different Machine Learning Algorithms"- Vol-2 Issue-6, June 2016.
[15] S. U. Ghumbre and A. A. Ghatol, “Heart Disease Diagnosis Using Machine Learning Algorithm,” Advances in Intelligent and Soft Computing Proceedings of the International Conference on Information Systems Design and Intelligent Applications.
[16] Machine learning based decision support systems (DSS) for heart disease Diagnosis: a review. Online: 25 March 2017 DOI: 10.1007/s10462-01
[17] DataSetURL-https://archive.ics.uci.edu/ml/machielearnindatabses/heartdisease