View of Prognosis of COVID- 19 Patients with Machine Learning Techniques

(1)

Prognosis of COVID- 19 Patients with Machine Learning Techniques

Dr. Bharati S Ainapure^1*, Prof. Reshma Pise², Aniket Anil Wagh³, Jitesh Tejnani⁵, Kaushal Oza⁵

1,2,3,4,5Department of Computer Engineering,Vishwakarma University, Pune, India.

ABSTRACT

The pandemic has continued persistently over the period of one and half year due to spread of Novel Coronavirus. Every country in the world has been experiencing different periods of surges. These surges are called as coronavirus waves. More lives have been affected during the second wave than the first. Due to the overwhelming health care burden on hospitals during the second wave, people with mild symptoms are advised home quarantine by doctors. People in home care need to be monitored continuously to know whether, they need further hospitalization, or they need any other medications, what are the readings of their health factors like fever, oxygen levels etc. Also people who are hospitalized need to be monitored. Machine learning techniques can provide better information to the health workers for patient care. This research proposes machine learning models which can identify the patient's condition into four classes: ―Need home care‖,

―Completely cured‖,‖ Need Hospitalization‖, and ‖ Mortality‖. The models are trained with clinical data of COVID-19 patients. To train the model, four machine learning algorithms are used: K-nearest neighbor (KNN), Random Forest Tree (RFC), ExtratreeClassifier (ETC) and ensemble technique. Further the models are validated using k-Fold validation during the training phase. Experiments were carried out on ten clinical parameters which are sufficient to identify the status of COVID-19 patient. Results show that models have performed well with an accuracy of 98.77% (KNN), 98.51 % (ETC), 98.05 % (RFC), and 98.77%

(ENSEMBLE). Prognosis of patients can assist the medical practitioners in making decisions related to health risks and identify the home quarantined patients who may need further hospitalization.

Keywords

Coronavirus, COVID-19, Machine learning, KNN, RFC, ETC, Ensemble, Prognosis.

Introduction

A very first case of A novel coronavirus (COVID- 19) known as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) was found in Wuhan state, China in December 2019.

This is contagious diseases caused by virus. This disease was later on 13th February 2020 named as COVID-19 by World Health Organization (WHO) [1]. Looking into quick spread of this disease all over the world WHO has declared it as ‗pandemic‘ on 11th March with more than 100000 cases, 4000 deaths in 114 countries. The COVID-19 disease shows variable symptoms such as cough, fever, fatigue, cold, headache, loss of smell and taste and breathing problem.

Symptoms may develop from day 1 to 14 days, after exposure to the virus. Sometimes symptoms may persist for longer time and develop many complications in human body which may lead to organ damage.

Some standard preventive measure reduces the spread of the virus are, social distancing, regular hand wash, use of hand sanitizers, covering mouth and nose when coughing and sneezing.

Avoiding the mass gathering will also decrease the spread of the virus. Many countries have enforced the lockdown and shut the whole country to control the spreading of the disease.

The pandemic has continued persistently over the period of one and half year. As on 9th June 2021 many variants of corona virus were found in different countries. Coronavirus is not new to the world. It belongs to the larger family of virus. In the past this virus was found in animals and

(2)

was transmitted to human being. Scientists have divided coronaviruses into four sub-groupings, called alpha, beta, gamma, and delta [2]. Ribonucleic acid (RNA) is the genetic material found in corona virus. When this material changes its property, mutation occurs and variant of coronavirus is produced. As per sources [3] Table 1 shows the different variants and the countries in which these variants are found.

Table 1. COVID-19 variants and their details WHO label for

COVID-19 variants

Different versions of

mutations

Country first detected

Date of detection

Transmission rate Alpha

B.1.1.7 United Kingdom September 2020

Dominating B.1.1.7+E484K United Kingdom December 2020 Outbreaks

Beta B.1.351 South Africa September

2020

Community

Gamma P.1 Brazil December 2020 Community

Delta B.1.617.2 India December 2020 Community

Every country in the world has been experiencing periods of surge over fast one and half year.

These surgeries are called as waves [4]. In Figure 1, you can see two spikes on the graph [5].

These are two waves that India is facing. From the graph it has been observed that there is a rapid rise in second wave compared to first wave. During the first wave of pandemic, coronavirus took 108 days for the rise of corona affected case from 80,000 to 97,000 and the period was from 2nd June 2020 to 17th September 2020. But in second wave you can see an instant rise in cases only in 63 days from about 8,000 on 2nd February ,2021 to 1,03,558 on 5th April, 2021, which surges to 3 lakhs and more cases till May 2021. During the first wave, number of elderly people were severely infected, but in second wave more number of youngster people below the age of 40 have been badly affected [6]. In second wave the death rate is also high as compared to first wave. Some of experts are predicting about third wave, which may be more severe than previous two waves, which will affect social determinants of health, and the next generation [7].

(3)

Figure 1. Graphs showing two surges of coronavirus in India

Due to the sudden increase of coronavirus cases during the second wave, hospitals were overwhelmed in many parts of India. Official authorities have set up many emergency hospitals specially for coronavirus infected patients in hotels, schools, colleges, train stations, playgrounds.

Due to the overwhelming of hospitals, many people having mild symptoms are suggested to become quarantine in home by doctors. Patients in home care need to be monitored continuously to know whether, they need further hospitalization, or they need any other medicines, what are the readings of their health factors like fever, oxygen levels etc. People admitted in the hospitals may also need continuous monitoring. Therefore the machine learning models are proposed to help caretakers of the home isolated patients and doctors, who are treating them to predict what would be the status of the patient in upcoming days.

Literature Review

Due to the outbreak of COVID-19 disease all over the world, many researchers have proposed machine learning and Artificial Intelligence based models for early detection and prevention of spread. Researchers have worked on different datasets like clinical dataset, blood reports, RT- PCR tests, X-ray and CT Scan images. In this section latest COVID-19 related work based on machine learning and AI models is reviewed.

Authors of the paper [8] have proposed a combined clustering and classification technique to classify the medical big data. The proposed technique is the joined execution of both the k-mean clustering method and RF (Random Forest) classification method. Authors have used KNN to cluster the data, then clusters are selected randomly on certain attributes to generate decision tree.

These decision three values are used to create required class labels at the end using random forest classifier. They have compared their work with existing LC-KNN and RC-KNN methods to

(4)

produce results. Authors have claimed that this combined technique has increased prediction accuracy.

An intense survey was proposed on AI and ML models to prognose COVID-19 diseases [9]. The author's intention was 1. To understand intelligent approaches applied to build COVID-19 prediction or classification models, 2. Study efficiency and impact of the methods applied to Prognose the COVID-19 infection 3. Understand the advanced methods and to study the nature of data processing challenges.

Authors of this paper [10] have proposed five most important chanlleges in responding to COVID-19 outbreak. According to authors five challneges are: 1. Managing limited healthcare resources, 2. Developing personalized patient management and treatment plans, 3. Informing policies and enabling effective collaboration. 4. Understanding and accounting for uncertainty and 5. Expediting clinical trials. They have also addressed how these challenges can be overcome with the help of AI and ML.

Support vector (SVM) based model was built to detect the severity of COVID-19 disease in [11].

Authors have used blood and urine report of 132 clinically confirmed COVID-19 positve patients dataset to train the model. They have extracted 32 features from these reports. First of all correlations was found between these features and then Pearson Correlation Coefficient (PCC) is used to identify the class labels. Twenty epochs were used to achieve an over all accuracy of 0.8148. The authors trained the binary classification model to classify the patients into two classes: severly ill and and mild symptoms.

Machine learning based mortality risk prediction in COVID-19 patients is proposed in [12].

Authors have proposed AI based model which will help the concerned authorities to decide, which COVID-19 patient will require highest attention so that the patient can get admission into hospitals on high priority. They have achieved overall 89.98% of accuracy in predicting mortality rate. The model was trained with machine learning algorithms like RFC, ANN, Logistic regression, KNN, SVM to predict the mortality rate. The model was finally evaluated using confusion matrix to make an in-depth analysis of classifiers and calculate the sensitivity and specificity.

COVID-19 patient future forecasting machine learning model is proposed in [13]. The model is able to predict three class labels for next 10 days: mortality rate, number of new COVID-19 positve patients rate and recovery rate. The model used different supervised machine learning algorithms like, regression model (LR), the lowest absolute and selective shrinking operator (LASSO), vector supports (SMS) and exponential smoking (ES). The models used dataset from Github registry. Authors have used the R2 scoring, Modified R-Square (R2), MSE, Mean Absolute Error (MAE) and Root Means Square Error (RMSE) to evaluate the performance of the model.

Machine learning model to predict deterioration of COVID- 19 patients is proposed in [14].

Authors have used 6995 patient data record to build the model. They used three machine learning algorithms to train the model: Neural Network, Random Forest , and Classification and Regression Decision Tree (CRT). The performance of the model is evaluated using metrics like mean, sensitivity, specificity, positive predictive value and accuracy. The model has performed

(5)

best in predicting critical COVID-19 patient with APACHE II score (ROC AUC of 0.92 vs.

0.79, respectively), 92.7% specificity and 92.0% accuracy and reaching sensitivity to 88.0%.

Another machine learning based model is proposed to predict the severity of COVID-19 home quarantine patient in [15]. Authors have used 287 COVID-19 samples of patients from the King Fahad University Hospital, Saudi Arabia to build model. The model has used 20 different clinical parameters like body temperature, pulse rate, oxygen level etc. to produce class labels

―survived‖ and ―Deceased . The authors have analyzed the complete dataset using three different algorithms: logistic regression (LR), random forest (RF), and extreme gradient boosting (XGB).

The performance of the model was evaluated using metrics like the F-Score, Specificity, sensitivity, precision and accuracy. Best performance is achieved using Random Forest classifier with 0.95 acuracy.

Some stastistical methods and machine learning techniques were used to predict and analyse the discharge time length of the COVID-19 patient [16]. This study used clinical data of 1182 patients. The paramters of clinical data are : surveyed cases including case ID, age, gender, the onset date of symptoms, date of hospitalization, infection confirmation date, death or discharge time, death or discharge status, symptoms, chronic disease history, travel history, and location.

Compared with other machine learning algortithms, Stagewise GB has predicted more accurate discharge time. Using Karle Max and Cox regression method authors have found out that recovery time is directly propotional to age and sex of the hospitalized patients.

Information analysis about coronavirus is proposed in [17] using machine learning methods.

Authors have mailny focused on five different types of analysis and prediction: 1. Coronavirus transmission rate 2. Correlation between weather conditions and corona virus. 3. How the pendamic end will happen 4. Prediction of coronavirus spread across different regions 5. Analysis of coronavirus growth rate in different counties and types of mitigations. Authors have concluded that many of the researchers are using Deep Learning and ML techniques to predict, analyze and screening of COVID-19. The authors have come up with good survey in which we can blend the biomedical and technology to solve problems related COVID-19.

Machine learning approach is proposed to detect the coronavirus infection from chest X-ray image was proposed in [18]. Authors have used 85 chest X-ray images which are freely available from sources to create machine learning model. They have proposed supervised machine learning technique which will automatically detect the COVID-19 disease. With the help of their model authors have achieved the average precision and recall equal to 0.965 in differentiating COVID-19 and other pulmonary disease in the chest X-ray image.

Another Deep learning model is proposed to predict COVID-19 diagnosis on chest X-ray in [19].

Authors have trained the model on dataset consists of 6868 CT scanned images of 418 patients.

Their model was able to successfully extract 2D features from the images. The model has achieved the accuracy of 0.956 on testing data set of 90 patients. Authors have verified their model by comparing with two radiologists readings of same independent test image data set.

They have also validated the model using rule-in and rule-out criteria.

All above mentioned literature have conveyed the importance of machine learning in COVID-19 patient related prediction, forecast and analysis. Some of the authors have proposed only

(6)

literature related use of machine learning in COVID-19 but some have proposed their findings related to COVID-19 using clinical records or X-ray images. During literature review, it was found that there is no realtime robust machine learning model proposed till today, which will predict and forecast the patient's status, those who are at high risk, as we all know that second wave has hit all the countries very badly. The main aim of the proposed model is to continuous monitoring of home quarantined and hospitalized patients and to identify high risk patients who should be treated on top priority.

The paper is organized into 4 different sections. Section 1 gives the introduction, section 2 proposes the literature review, section 3 proposes model overview and implementation, then section 4 talks about results and experimental setup.

Proposed model

To propose machine learning model, clinically confirmed, Real-Time Reverse Transcription–

Polymerase Chain Reaction(RT-PCR) tested SARSA-CoV-2 positive patient dataset is used. The information collected is for analysis purposes only and will not be used to reveal the identity of the patient. The dataset is based on the following symptoms of the COVID-19 patients, which is released by Government of India [20]:

1. Blood Pressure (BP) 2. Oxygen saturation (OS) 3. Respiratory Rate (RR) 4. Temperature (T)

5. High Resolution Computed Tomography (HRCT ) 6. D-dimmer (DD)

7. C-reactive protein (CRP) 8. Ferritin (FE)

9. Erythrocyte sedimentation rate (ESR) 10. Interleukin (INTR)

Data collected in the span of 7 alternate days. A total of 450 patient records was collected. The dataset contains both male and female with an average age rate of 49.9 and 66 features of each patient. Figure 2 shows the high level system architecture of the proposed model. During data preprocessing stage, irrelevant data samples were removed. For the continuous data mean is used to substitute the missing value and mode is used for binary data samples. The proposed model has maintained the balanced dataset to make the model accurate and unbiased. The dataset includes separate training and testing data which is selected randomly.

(7)

Figure 2.High level system architecture of the proposed model

The dataset includes four binary class outcome labels, namely, ―Need home care (Y/N)‖,

―Completely cured (Y/N)‖,‖ Need Hospitalization (Y/N)‖,‖ Mortality (Y/N)‖ respectively. A total of 66 features extracted from the original data set for each patient which includes the symptoms. Figure 3 shows the distribution of dataset in each of these classes. The class labels are encoded as ‗Yes‘ or ‗No‘ with values 1 or 0 respectively. Table 2 indicates the description, in which the field RR1, RR3, RR5, RR7, RR9 and RR11 are the respiratory records measured in breaths per minute and there are 7 readings recorded for each of the patients from confirmation of COVID – 19 positive to 11 days. Similarly, BLP_L and BLP_H are the blood pressure records for low and high reading, OS are the Oxygen saturation level, TD_1 to TD_11 represents temperature in Celsius, HRCT, D-Dimer, Ferrtin, ESR and Interleukin data values are described in the Table 2.

Table 2. Summary of dataset

Parameters Data Types

count Mean Standard

Deviation

Minimum Maximum

Reports int64 450 225.5 130.048068 1 450

Age int64 450 45.91777778 15.29006532 17 100

Gender int64 450 0.484444444 0.50031418 0 1

RR_D1 to RR_D11

int64 450 -- -- -- --

BPH_D1 to BPH_D11

float64 450 -- -- -- --

BPL_D1 to BPL_D11

float64 450 -- -- -- --

OS_D1 to OS_D11

float64 450 -- -- -- --

T_D1 to T_D11

float64 450 -- -- -- --

(8)

HRCT_D1 to

HRCT_D11

float64 450 -- -- -- --

Dd_D1 to Dd_D11

float64 450 -- -- -- --

CRP_D1 to CRP_D11

float64 450 -- -- -- --

FE_D1 to FE_D1

float64 450 -- -- -- --

ESR_D1 to ESR_D11

float64 450 -- -- -- --

INTR_D1 to INTR_D11

float64 450 -- -- -- --

Figure 3. Dataset distribution in classes.

Prediction Model

Initially the total data points collected were 450*66 which are divided into 2 parts: training and testing. 80% of data points were used to train the model and 20% of data points were used for testing. 80 % of the training data set is validated using ‗K-fold cross validation‘ method. To build a model, four different machine learning classification algorithms were used: K-Nearest neighbor, Random Forest Tree and ExtratreeClassifier and ensemble technique. Explanations about these algorithms is given in the next sections

K-Nearest neighbor algorithm

In the training phase out of 2 random parts that are 80% of the data is partitioned using a K - neighbor (KNN) algorithm. KNN is the simplest supervised machine learning algorithm for classification. KNN works on similarity index. During the training phase, KNN compares new

(9)

data point with the available data points and put the new data point in the class which is most similar to the available class [21]. The ‗K‘ value in KNN represents the number nearest data point to be considered. The KNN algorithm takes decisions based on ‗K‘ values. The proposed work used K=7 and made KNN to directly classify the training data set which is 80% of the total data set. Classification of new data point is followed by searching the similar ‗K‘ neighbor data point in the entire training dataset and classifying that new data point based on the class having highest data points. The similar data point is identified using the following Euclidean distance equation, if the data points are continuous.

Euclidean distance= ^k_i=1 x_i− y_i ² (1)

If the data point is categorial then, the model has used following Hamming distance equation.

Hamming distance D = ^k_i=1|x_i− y_i| (2) Where x = y => 𝐷 = 0 and x ≠ y => 𝐷 = 1

Figure 4 shows how KNN has classified the new data point during the training phase, when K=7.

Figure 4. Prediction using KNN

Table 3 shows the parameters used to implement KNN classifier.

Table 3. KNN parameters used for classification Name of the Parameter The value used

n_neighbours 7

metric Euclidean

weights uniform

leaf_size 15

Random Forest Classifier (RFC) Algorithm

(10)

The second method used to train the model is Random forest algorithm [22][21]. It is the ensemble model used for both classification and regression problems. Ensemble learning improves the performance of the model. Ensemble methods combine multiple classifiers to solve complex problems. It is better to use RFC, when your dataset is large and expecting the high accuracy of the model. It is faster in computation compared to other machine learning algorithms.

The proposed RFC has many decisions trees. While building trees, random data points are drawn from the training sample with replacement. In replacement method some training samples may use multiple times while constructing the trees which will lead to lower variance and will not increase the bias in the trees. The algorithm constructs a decision tree for every subset of the sample and predict the result of every tress. In testing, prediction is made on voting method.

Voting is performed on predicted results from individual trees and most voted prediction is treated as final prediction. Splitting of the nodes is done based on a subset of all features which is again picked randomly. The function is set to sqrt(n_features) for splitting the nodes.

Selection of decision tree node is made using following entropy equation.

E s = ^c_i−1− p_ilog₂p_i (3) Where E s is the entropy and p_i is probability.

And root node is decided to use the information gain equation as follows:

Gain T, X = Entropy T − Entropy(X) (4)

Table 4 shows parameters used to implement random forest classifier

Table 4. parameters used to implement random forest classifier Name of the Parameter The value used

n_estimators 100

max_depth 10

random_state 7

Min_samples_split 3

Min_samples_leaf 1

ExtraTreeclassifier (ETC)

The third technique used to build the model is Extratreeclassifier[23]. The is also an ensemble method as of RFC. In this technique results of multiple decision trees are aggregated to predict final classification. It is very much similar to RFC but differs in the way the trees are constructed. The is also called as an extremely random classifier. In this technique multiple trees are constructed from original training data sample instead of using replacement method. In the proposed model, each of the trees in the forest is built selecting the best features from training data set. The split of the node is random and based on the Gini index [24]as given below:

Gini = 1 − (p_i)² (5)

With this technique the variance was low and also it boosted the accuracy and it took very less computational time.

(11)

Table 5 shows parameters used to implement ExtraTreeClassifier classifier

Table 5. Parameters used to implement ExtraTreeClassifier classifier

Name of the Parameter The value used

n_estimators 100

max_depth 10

random_state 7

Criterion Gini index

Min_samples_leaf 1

Min_ample_Split 2

Ensemble model

Ensemble models combine multiple classifiers to boost the performance of the model. This technique can be used for both regression and classification problems. Voting is ensemble technique present machine learning used to improve the performance of the model. For regression problem voting makes the prediction based on calculating the average of other regression models. For classification problem voting prediction is based on majority vote, in which predictions of each class labels are summed and the class label with the highest majority vote is considered as final prediction [25]. In classification there are two types of voting involved: 1. Soft voting and 2. Hard voting. In soft voting, probabilities of predicted class labels from multiple models are summed and the largest sum probability is the final prediction. In hard voting crisp class labels are summed from multiple models and the class moth most voted model is the final prediction. The proposed model is predicted based on the following equation:

𝑑

_𝑡,𝐽

= 𝑚𝑎𝑥

_{𝑗 =1}^𝐶

𝑇𝑡=1 ^𝑇_𝑡=1

𝑑

_𝑡.𝑗

(6)

Where T is the number of classifiers and C is the number of Classes. And decision of tth classifier as dt,j E{0,1} t=1…, T and J= 1..C

Model Evaluation

The performance of the model is evaluated using standard metrics such as accuracy, sensitivity, specificity, recall and F-score. Number of samples correctly predicted by the model are known as true positive (TP), number of samples correctly predicted as negative are known as true negative (TN). The model may sometimes The true negative (TN) and the false positive (FP) were defined as the numbers of correctly and incorrectly predicted negative samples, respectively.

Accuracy defines the number of class labels identified correctly. It is the ratio between correct prediction samples to total number of samples. Following equation is used to calculate the accuracy of the model.

Accuracy= (TN + TP) (TN + TP) (TN + TP + FN + FP) (7)

Sensitivity defines true positive predictions. It is nothing but the number of times the model predicted the positive class labels as positive. It is the ratio between true positive prediction to total number of positive assessments. Proposed model used the following equation to predict the sensitivity

(12)

Sensitivity = TP (TP + FN) (8)

Specificity defines true negative predictions made by the model. It is the ratio between the number of true negative predictions to total number of negative assessments. Proposed model used the following equation to define specificity

Specificity = TN (TN + FP) (9)

F1-score represents the harmonic mean of the precision and recall. When precision becomes equal to recall then F1-score will reach maximum. In practice F-score maintains balance between precision and recall. The equation for F1-Score is as follows.

F1 score= 2 ∗ ((Precision ∗ Recall) Precision + Recall) (10) Results and discussion

The models were developed using the Python language using Pycharm Community edition 2021.1 and Pandas, Numpy and Sklearn, Pickle libraries. The proposed methodology aimed to predict four class labels once, so Multioutputclassifier is used. AUC (Area Under The Curve)- ROC (Receiver Operating Characteristics) curve is used to measure the performance of the model as proposed model is classification problem. It measures various thresholds of the models by distinguishing the class labels. Higher the AUC , better the model performance. ROC curves are plot know what is the ratio between the true positive rate and false positive rate. To keep track of bias and variance to check for over-fitting and under-fitting the Kfold (sklearn.model_selection.Kfold) method is used.

Figure 5 shows highest affected parameters on each of the day in the model. Oxygen saturation is the parameter which is most affected on 11th day of the data collection.

Figure 5. Clinical parameters distribution for 7 days

(13)

As mentioned earlier, KNN, RFC, ETC and Voting ensemble are used to develop the model. Out of these Voting ensemble was out performed compared with the other models with accuracy 98.5%. Table 6 shows the performance score of the KNN with respect to four class labels.

Table 6. Performance Score of KNN and Ensemble

Table 7 shows the performance score for models built using RFC. This gives better performance than the KNN.

Table 7. The performance score RFC

Table 8 shows the performance score of the model built using ETC. This model has done out performance compared other two models.

Table 8. The performance score of ETC

Figure 6 shows learning curve for the model. The learning curve is used to know how much model accuracy will going to vary with an increase in the number of training samples. It shows validation and training score of the model. It also shows how much a model is suffering from bias error and variance error. From the graphs it clear that all four models have continued to improve as there is an increase in the learning data. This indicates the high variance in the model which may lead to overfitting of the models. Thus models can employ more training samples to avoid the overfitting problem. The proposed model used voting methods for ensemble the remaining three methods, but is observed that there is no significant improvement in the performance.

(14)

Figure 6 a. Accuracy of KNN and Ensemble during training phase using AUC- ROC

Figure 6 b. Accuracy of RFC during the training phase using AUC-ROC

Figure 6 c. Accuracy of ETC during the training phase using AUC-ROC

Figure 7 shows the performance of each model using leaning curve method. The model was reviewed during training to avoid over-fitting or under-fitting problems. From the graph it is clear that the model has performed well in the training phase.

(15)

Figure 7 a. Performance accuracy of KNN and Ensemble

Figure 7 b. Performance accuracy of RFC

Figure 7 c. Performance accuracy of ETC

Finally to visualize the prediction, User Interface (UI) is developed which can be used by health workers to predict the patient status based on the clinical parameters. The user can select a file with the patient details. Then the user interface will show the prediction results. The user interface is entirely designed using python and FLASK library. Figure 8 shows a sample screenshot of the UI.

(16)

Figure 8. UI of the model

Conclusion

Machine learning based COVID-19 prognosis model is proposed in this work. The model can help health workers in predicting the severity of COVID-19 patient. This model is able to classify COVID-10 patient into four different classes, namely ―Need home care‖, ―Completely cured‖,‖

Need Hospitalization‖, and ‖ Mortality‖ respectively based on the patient clinical parameters.

Proposed model has explored the impact of clinical parameters such as fever, oxygen saturation level, blood pressure etc. in COVID-19 patients. Four machine learning techniques were used to propose the model: KNN, RFC, ETC and ensemble technique. All the models have performed well during training and testing. The model has been tested using k-fold cross validation method during training phase. Even though the Voting ensemble methods have not given much increasing in the performance but, in future, with the use of a huge number of samples, performance of the model can be improved. In future, the model is expected to incorporate CT- scan and X-ray images to predict the COVID-19 patient status.

References

[1] ―WHO EMRO | About COVID-19 | COVID-19 | Health topics.‖ [Online]. Available:

http://www.emro.who.int/health-topics/corona-virus/about-covid-19.html. [Accessed: 19-Jun-2021].

[2] ―How Many Coronavirus Strains are There? Novel Coronavirus Types and More.‖ [Online]. Available:

https://www.webmd.com/lung/coronavirus-strains#2. [Accessed: 19-Jun-2021].

[3] ―SARS-CoV-2 variants of concern as of 18 June 2021.‖ [Online]. Available:

https://www.ecdc.europa.eu/en/covid-19/variants-concern. [Accessed: 19-Jun-2021].

[4] ―Second wave of Covid-19 to hit India? Here‘s why the country must be ready for coronavirus‘ second wave - The Financial Express.‖ [Online]. Available: https://www.financialexpress.com/lifestyle/health/second-wave- of-covid-19-to-hit-india-heres-why-the-country-must-be-ready-for-coronavirus-second-wave/2097089/. [Accessed:

19-Jun-2021].

[5] ―Covid-19 in India: Why second coronavirus wave is devastating - BBC News.‖ [Online]. Available:

https://www.bbc.com/news/world-asia-india-56811315. [Accessed: 19-Jun-2021].

[6] T. Fisayo and S. Tsukagoshi, ―Three waves of the COVID-19 pandemic,‖ Postgrad. Med. J., vol. 97, no.

1147, p. 332, 2021.

[7] L. Elliot Major Stephen Machin, L. Elliot Major, and S. Machin, ―Covid-19 and social mobility Covid-19 and social mobility CEP COVID-19 ANALYSIS,‖ no. 004, 2020.

[8] R. Saravana Kumar and P. Manikandan, ―Medical big data classification using a combination of random forest classifier and K-means clustering,‖ Int. J. Intell. Syst. Appl., vol. 10, no. 11, pp. 11–19, 2018.

(17)

[9] J. Nayak and B. Naik, ―Intelligent system for COVID-19 prognosis : a state-of-the-art survey,‖ pp. 2908–

2938, 2021.

[10] M. van der Schaar et al., ―How artificial intelligence and machine learning can help healthcare systems respond to COVID-19,‖ Mach. Learn., vol. 110, no. 1, pp. 1–9, 2021.

[11] H. Yao et al., ―Severity Detection for the Coronavirus Disease 2019 (COVID-19) Patients Using a Machine Learning Model Based on the Blood and Urine Tests,‖ Front. Cell Dev. Biol., vol. 8, no. July, pp. 1–10, 2020.

[12] M. Pourhomayoun and M. Shakibi, ―Predicting mortality risk in patients with COVID-19 using machine learning to help medical decision-making,‖ Smart Heal., vol. 20, no. April 2020, p. 100178, 2021.

[13] R. Kumar, A. Yadav, A. V Prabhu, and Y. Natarajan, ―Since January 2020 Elsevier has created a COVID- 19 resource centre with free information in English and Mandarin on the novel coronavirus COVID- 19 . The COVID-19 resource centre is hosted on Elsevier Connect , the company ‘ s public news and information ,‖ no.

January, 2020.

[14] D. Assaf et al., ―Utilization of machine-learning models to accurately predict the risk for critical COVID- 19,‖ Internal and Emergency Medicine, vol. 15, no. 8. pp. 1435–1443, 2020.

[15] S. S. Aljameel, I. U. Khan, N. Aslam, M. Aljabri, and E. S. Alsulmi, ―Machine Learning-Based Model to Predict the Disease Severity and Outcome in COVID-19 Patients,‖ Sci. Program., vol. 2021, 2021.

[16] M. Nemati, J. Ansary, and N. Nemati, ―Machine-Learning Approaches in COVID-19 Survival Analysis and Discharge-Time Likelihood Prediction Using Clinical Data,‖ Patterns, vol. 1, no. 5, p. 100074, 2020.

[17] M. Yadav, M. Perumal, and M. Srinivas, ―Since January 2020 Elsevier has created a COVID-19 resource centre with free information in English and Mandarin on the novel coronavirus COVID- 19 . The COVID-19 resource centre is hosted on Elsevier Connect , the company ‘ s public news and information ,‖ no. January, 2020.

[18] L. Brunese, F. Martinelli, F. Mercaldo, and A. Santone, ―Machine learning for coronavirus covid-19 detection from chest x-rays,‖ Procedia Comput. Sci., vol. 176, pp. 2212–2221, 2020.

[19] D. Javor, H. Kaplan, A. Kaplan, S. B. Puchner, C. Krestan, and P. Baltzer, ―Deep learning analysis provides accurate COVID-19 diagnosis on chest computed tomography,‖ Eur. J. Radiol., vol. 133, no. October, p. 109402, 2020.

[20] ―Clinical management protocol for COVID-19,‖ vol. 12, no. 13, pp. 754–757, 2016.

[21] T. M. Mitchell, Machine Learning. 1997.

[22] ―Random forests - classification description.‖ [Online]. Available:

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm. [Accessed: 21-Jun-2021].

[23] ―ML | Extra Tree Classifier for Feature Selection - GeeksforGeeks.‖ [Online]. Available:

https://www.geeksforgeeks.org/ml-extra-tree-classifier-for-feature-selection/. [Accessed: 21-Jun-2021].

[24] ―sklearn.tree.ExtraTreeClassifier — scikit-learn 0.17.dev0 documentation.‖ [Online]. Available:

http://scikit-learn.sourceforge.net/dev/modules/generated/sklearn.tree.ExtraTreeClassifier.html. [Accessed: 21-Jun- 2021].

[25] ―How to Develop Voting Ensembles With Python.‖ [Online]. Available:

https://machinelearningmastery.com/voting-ensembles-with-python/. [Accessed: 21-Jun-2021].

Authors Details:

Dr. Bharati Ainapure has completed B.E. in Computer Science and Engineering from Karnataka University and M.

Tech in Computer Science and Engineering from Vishweshryaya Technological Univeristy, Kanataka, in 2008. She did her Ph.D from JNTU, Anatapur, India.

Currently, she is working as Associate Professor in Computer Engineering Department, Vishwakarma Univerisry, Pune, India.

She has more than 20 years of experience in teaching and industry and has published more than 30 research papers in renowned international journals and conferences. She has got an Australian patent grant in 2020. Her research interests

(18)

include Cloud Computing, Parallel Computing and high performance computing.

Reshma Pise has completed B.E. in Computer Engineering from Karnataka University, M.E. in Computer Engineering from Savitribai Phule Pune University (SPPU) in 2004 and currently pursuing Ph.D. at Vishwakarma University, Pune.

She is working as Assistant Professor in Computer Engineering Department, Vishwakarma Univerisry, Pune, India. She has more than 20 years of experience in teaching.

Her research interests include Machine Learning, Data Science and Compiler Design.

Aniket Anil Wagh is a student of Vishwakarma University ,Pune, India from 2019-present. He is pursuing his BTech in Computer Science Engineering.

He has been researching in the field of AI since 3 years now and has worked on many projects based on AI and IoT.

Recently he has published a paper aiming to control the spread of the Covid virus using Image processing and IoT.

His research interests include Data Analysis, Machine Learning, Deep Learning, IoT and Robotics

Jitesh Dhalu Tejnani is a student of Vishwakarma University, Pune, India from 2019-present. He is pursuing his BTech in Computer Science and Engineering with specialization in Business Analytics. Recently he has published a paper on controlling social distance during pandemic using Computer Vision and Deep Learning.

His research interests include Data Analysis, Machine Learning, Deep Learning, and AI.

Kaushal Rohit Oza is a student of Vishwakarma University, Pune, India from 2019 to the Present. He is pursuing his Btech in Computer Science and Engineering with a specialization in Business Analytics. Recently he worked on a project which aimed to control social distancing using Computer Vision and Machine Learning.

He is extremely passionate about Data Science and his research interests include Machine Learning, Deep Learning, Artificial Intelligence and IoT.