• Nu S-Au Găsit Rezultate

View of A Random Forest - Genetic algorithm integration approach for Hepatocellular Carcinoma Early prediction

N/A
N/A
Protected

Academic year: 2022

Share "View of A Random Forest - Genetic algorithm integration approach for Hepatocellular Carcinoma Early prediction"

Copied!
13
0
0
Arată mai multe ( pagini)

Text complet

(1)

A Random Forest - Genetic algorithm integration approach for Hepatocellular Carcinoma Early prediction

Esraa M. Hashem1*, Mohamed Refaat Aboel-fotouh2

1Biomedical Engineering Department, Misr University for Science and Technology (MUST University), 6th of October, Egypt

2,

PhD, Industrial Engineer, Researcher

*[email protected]

ABSTRACT

Malignant liver cancer is Hepatocellular carcinoma (HCC). It has a strong effect on individuals’ lives and may decrease the number of annual deaths by early investigating.There are liver disease forms, including fatty liver, cirrhosis, hepatitis, chronic liver disease, liver cancer, and hepatocellular carcinoma (HCC). As the incidence of this disease has risen dramatically in recent years, several researchers have also devoted great attention to this issue through expert systems and machine learning. Machi ne learning has been applied more recently to cancer prognosis and prediction. In this work, machine learning methods based on a k- nearest neighbor, support vector machine, naïve Bayes, and random forest classifiers with the integration of genetic algorithm for classifying 165 liver cancer patients with specific 49 combinations of features, testing all classifiers by assessing their output based on accuracy, error rate, sensitivity, prevalence, and specificity. Results show that the random forest with parameters optimization using genetic algorithm (GA) yielded the highest accuracy, and specificity indicating improvement over the classical approaches commonly used in predictive models of Hepatocellular Carcinoma. The finding referred to various useful attributes such as alpha-fetoprotein, hemoglobin, alkaline phosphatase, total bilirubin, albumin, and total proteins, which can help in early prediction of hepatocellular carcinoma.

Keywords

machine learning; hepatocellular carcinoma; random forest; genetic algorithm.

Introduction (Times New Roman, bold, 12)

The liver is the human body’s largest internal organ, and plays a significant role in metabolism and has many essential functions. Hepatic disease raises a variety of problems for medical care delivery. Liver disease refers to many diseases and disorders that can caused impaired liver function, reducing liver function. Dysfunction may be primary, but the liver is also secondarily affected by other organ systems disorder since it is involved in many processes of metabolism and detoxification [1]. There are various forms of liver disease, including fatty liver, cirrhosis, hepatitis, chronic liver disease, liver cancer, and hepatocellular carcinoma (HCC) [2].

According to the latest data, HCC is one of the deadliest cancers in the world that causes more than 600,000 deaths annually [3]. The disease is usually less in females, with two to three in males due to the higher risk factors in males and likely epigenetic factors [4]. HCC has been reported as the third prevalent cause of death from cancer worldwide and ultimately dead patients with cirrhosis. It typically occurs in patients with chronic hepatic disease and/or cirrhosis.

The key causes for developing HCC are viral infection with Hepatitis B and C, alcohol, and aflatoxin B1, which are the most common underlying cause of chronic hepatic disease, which leads to cirrhosis of the liver and chronic hepatitis [4]. HCV is a single RNA virus that has a positive-stranded. These risk factors can induce damage in DNA sequences and mutations, such as aflatoxin induced p53 mutation and DNA damage caused by HBV genome insertion [5].

Infection with HCV affects 3:4 million people worldwide, and about 170 million people are chronically diagnosed with virus C [6]. For all cases, the tumor size, when HCC is first identified does not predict the progression of the disease. In fact, for a small HCC, the median period of doubling volume may range from 1 to 20 months [7]. Detection and characterization of tumor

(2)

vascularity are critical in the differential ways: diagnosis, the choice of treatment method, and HCC therapeutic response assessment. The clinical behavior of HCC is difficult to estimate.

Therefore, there is a critical demand to obtain new techniques to assess the early detection of hepatic cancer patients [8].

To date, there has been no clear evidence emerging that increasing survival gain with e high-risk patient surveillance. Diagnostic tools widely used to enhance patient survival are alpha- fetoprotein (AFP) serum tumor marker, radiographic imaging, liver biopsy, and biomarkers [4].

Nonetheless, a cancer prognosis usually includes several physicians from various specialties who use various subsets of biomarkers and multiple clinical variables, including the patient’s age, general health, the location and type of cancer, and the tumor grade and size [9]. Concerning both the hepatitis virus, their corresponding key markers include the measurements of different antigens and antibodies. At the same time, cirrhosis is typically measured with the Child-Pugh (CP) score, which employs five clinical measures of hepatic disease (Total Bilirubin, Albumin, Encephalopathy, Ascites and Prothrombin Time). Cirrhosis occurs in more than 80% of HCC cases and is specifically known as the primary cause of this disease [10].

Machine learning is an artificial intelligence branch that applies various statistical, optimization and probabilistic techniques that enable computers to “learn” from previous examples and identify patterns that are difficult to distinguish from big, complex or noisy data sets [11].

Machine learning is not new to cancer science. Decision trees (DTs) have been used for nearly 20 years for cancer diagnosis and detection [12].

Machine learning methods are used today in a wide range of applications ranging from the detection and classification of tumors via X-ray and CT images [13] to the classification of malignancies from genomic (microarray) assays and proteomic. Recently, machine learning has been applied to early disease detection and diagnosis, and there are several types of classification algorithm used for cancer diagnosis or, generally known as classifiers. The Artificial Neural Network (ANN), Support Vector Machine (SVM), Genetic Algorithm (GA), Fuzzy Set (FS), and Rough Set (RS) are some of these. They are used for the classification of cancer data as malignant tumors and benign tumors [9].

This work focuses on the pre-diagnosis of 156 patients with Hepatocellular Carcinoma using machine learning, k-nearest neighbor (K-NN), naive Gaussian Bayes(NB), support vector machine (SVM), and random forest (RF) as a genetic algorithm classification algorithm to improve classifiers accuracy and ordering all 49 attributes affecting HCC diagnosis.

Methods

Cases study

The work was carried out using the dataset provided by Coimbra’s Hospital and University Centre (CHUC), Portuga [14]. The dataset contains165 confirmed patients with HCC. Table 1 provided an overview of the heterogeneous dataset consisting of 23 quantitative features and 26 qualitative features (n=23+26=49). In Error! Reference source not found., only 4.85% of the dataset has complete information about all the features. (49 clinical variables, including ratio-scaled, dichotomous and ordinal variables). Overall, missing data represents 10.22% of the whole dataset.

The gender, symptoms, alcohol, cirrhosis, smoking, diabetes, obesity, hemochromatosis, arterial hypertension, chronic renal insufficiency, human immunodeficiency virus, nonalcoholic steatohepatitis, esophageal varices, splenomegaly, portal hypertension, portal vein thrombosis,

(3)

liver metastasis, radiological hallmark, and class are nominal values. The age at diagnosis and the number of nodules are integer values. The performance status, encefalopathy degree, and ascites degree are ordinal values while Grams of Alcohol per day, Packs of cigarettes per year, International Normalized Ratio, Alpha-Fetoprotein (ng/mL), Hemoglobin (g/dL), Mean Corpuscular Volume (fl), Leukocytes(G/L), Platelets (G/L), Albumin (mg/dL), Total Bilirubin(mg/dL), Alanine transaminase (U/L), Aspartate transaminase (U/L), Gamma glutamyl transferase (U/L), Alkaline phosphatase (U/L), Total Proteins (g/dL), Creatinine (mg/dL), Major dimension of nodule (cm), Direct Bilirubin (mg/dL), Iron (mcg/dL), Oxygen Saturation (%), and Ferritin (ng/mL) are continues values.

According to the EASL-EORTC (European Association for the Study of the Liver – European Organization for Research and Treatment of Cancer) Clinical Practice Guidelines, they have been selected. The survival target variable is encoded as a binary variable with values 0 and 1, meaning a patient has not survived or survived, respectively. Apart from statistical methods, missing values are imputed by calculating the median, mean, and most frequented data. NB, SVM, RF, and k-NN have commonly used machine learning classifiers in all data to offer the best classifier give the best accuracy.

Table 1 HCC data sets and attributes

Boolean and categorical data (most frequent)

Integer data (median)

Decimal data (mean)

Gender Age at diagnosis International Normalized Ratio:

INR

Symptoms Grams of Alcohol per day

(Grams/day) Alpha-Fetoprotein (ng/mL): AFP

Alcohol Packs of cigarettes per year

(Packs/year) Hemoglobin (g/dL)

Hepatitis B Surface Antigen: HBsAg Platelets Mean Corpuscular Volume (fl):

MCV

Hepatitis B e Antigen: HBeAg Alanine transaminase (U/L) Leukocytes (G/L) Hepatitis B Core Antibody: HBcAb Aspartate transaminase (U/L) Albumin (mg/dL) Hepatitis C Virus Antibody: HCVAb Gamma glut amyl transferase

(U/L) Total Bilirubin (mg/dL)

Cirrhosis Alkaline phosphatase (U/L) Creatinine (mg/dL)

Endemic Countries Number of Nodules Major dimension of nodule (cm)

Smoking Iron (mcg/dL) Direct Bilirubin (mg/dL)

Diabetes Oxygen Saturation % Ferritin

Obesity Performance Status Total Proteins TP

Hemochromatosis Encephalopathy degree

Arterial Hypertension: AHT Ascites degree Chronic Renal Insufficiency: CRI

Human Immunodeficiency Virus: HIV Nonalcoholic Steatohepatitis: NASH Esophageal Varices

Splenomegaly Portal Hypertension Portal Vein Thrombosis Liver Metastasis Radiological Hallmark

(4)

Methodology

To find the best techniques to pre-diagnose cancer, the researchers carried out multiple comparative studies in cancer classification. The outcome obtained from the previous studies is, However, consistent [9]. Numerous studies have been made on HCC and hepatic diseases. Santos et al. [14] were testing HCC using both heterogeneous and missing data (HEOM) and clustering techniques (K-means). The findings showed the proposed would effectively detect the HCC.

Hassoon et al. [15] introduced a new approach for optimizing the rules developed using a boosted C5.0 genetic algorithm to detect liver disease in good time. The genetic algorithm aims not to generate rules but to delete unnecessary limits. The suggestion solution improved C5.0 accuracy from 81% to 93% [16].

Our proposal classifiers (SVM, NB, KNN, RF) are implemented in Python language and measured by performance criteria such as error rate, sensitivity, specificity, accuracy, and prevalence, which can be defined as follows. Equation 1 a classifier error rate of the proportion of the test set that is classifier incorrectly classifies. Accuracy: Accuracy is the percentage of correct classifications (1-error rate).

𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 =𝐼𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 × 100 𝑒𝑞. 1

Equation 2 Sensitivity is referred as True positive rate

𝑆𝑒𝑛𝑒𝑠𝑡𝑖𝑣𝑖𝑦 =𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 × 100

Equation 3 Specificity is the correct negative rate which is the proportion of true negative samples.

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 × 100 𝑒𝑞. 3

Equation 4 Prevalence is defined as the proportion of the true positives against the results of entire sample.

𝑝𝑟𝑒𝑣𝑎𝑙𝑒𝑛𝑐𝑒 = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠× 100 𝑒𝑞. 4

The data used in this work and the four well-known machine learning methods are described in Error! Reference source not found..

(5)

Raw Data 156 Patient 49 Attributes

Pre-Processing

 Missing Boolean and categorical data (complete with most frequent value of a given attribute)

 Missing integer data (calculate the median value for attributes)

 Missing decimal data ( calculate mean for attributes)

Machine learning classifiers

Gaussian Naive Bayes

K-Nearest Neighbors (K-NN)

Support Vector Machine (SVM)

Random Forest (RF)

Evaluation criteria Error rate, Sensitivity, Specificity, Prevalence,

Accuracy

Ranking Features passed on

best classifier Genetic

Algorithm(GA) Optimizing Parameters

Number of neighbors &

Power parameter for the Minkowski metric

Regularization parameter

The number of trees in the forest

Figure 1 overall process

1. Naive Bayes(NB) Algorithm

The naive Gaussian Bayes method (NB) is a simple approach to probabilistic inference that has been applied successfully in a variety of machine learning applications [17].

Naïve Bayes Classifier is a recognition model based on the Bayes rule. NB Classifier is known to be better than some other classification methods. The Bayes formula is the basis of the NB theorem used as Error! Reference source not found., [18].

(6)

𝑷 𝒄 𝒙 =𝒑 𝒄 𝒑 𝒙 𝒄

𝒑(𝒙) 𝑾𝒉𝒆𝒓𝒆: 𝒙 = 𝒙𝟏. 𝒙𝟐. 𝒙𝟑. ⋯ 𝒙𝒏 𝒆𝒒. 𝟓 Where x is attributed, C is classes, 𝑃(𝐶|𝑋): the probability of even 𝐶 given 𝑋 has occurred, 𝑃(𝑋|𝐶): the probability of even 𝑋 given 𝐶 has occurred, 𝑃(𝐶): the probability of event C, and 𝑃(𝑋): probability of event X.With the substitution of X, the Bayes formula can be written as follows Error! Reference source not found.[19].

𝑃 𝑐 𝑥1. 𝑥2. ⋯ 𝑥𝑛 =𝑝 𝑐 𝑝(𝑥1. 𝑥2. ⋯ 𝑥𝑛|𝑐)

𝑝(𝑥1. 𝑥2. ⋯ 𝑥𝑛) 𝑒𝑞. 6

In the training phase, the class label of a testing data point is calculated using the conditional probabilities and the class probabilities. In two categorized data sets, the data point is ranked according to which class probability is greater [20].

2. K-Nearest Neighbors (K-NN) Algorithm

K-NN has been one of the supervised learning methods used in statistical pattern recognition, data mining, and many other. It follows a methodology for classifying objects in feature space based on nearby training examples [17]. This algorithm keeps current cases in their entirety existing, whereas new cases are classified according to a similarity calculation. It implements the construction of multi-dimensional feature space, in which the different dimensions are correlated with different signal features [21].

3. Support Vector Machine (SVM) Algorithm

SVM is an efficient mechanism, which can be applied to both classification and regression. SVM divides the data into two categories of classification efficiency and N-dimensional hyperplane construction. These models are strongly linked to conventional neural networks of multilayer perceptron [1]. In the SVM literature, an independent variable named attribute and a transformed attribute used to describe the hyperplane is called a feature [22]. In the classification of cancer, the classes will be classified into benign and malignant tumors. The purpose of this design is to find out the optimal hyperplane that divides vector clusters in such a form that cases with one target variable category are on one side of the plane and cases with the other category is on the other side of the plane scale. The vectors near the hyperplane are the support vectors [17]. As shown in Error! Reference source not found..

Figure 2 support victor machine classifier

(7)

4. Random Forest (RF) Classifier

This is a group of unpruned classification trees generated in tree induction by utilizing a random selection of features and bootstrap iterations of the training dataset. [23]. Unlike other decision trees (DT) algorithms, through the uses of integration of various features, a generated tree is grown to the maximum possible depth during recent training datasets. Those fully grown trees are not pruned [24].

Random forest is a combination of unpruned classification trees generated by utilizing training data bootstrap samples and tree induction random selection of features. For dimensional random vector X= (X1..., Xp) T expressing the real-valued input or predictor variables and a random variable Y expressing the real-valued response, assume an unknown joint distribution PXY (X, Y).

The objective is to find a predictive (X) function for estimating Y. The correlation function is described by a loss function L (Y, f (X)), and the expected loss values are minimized [25]. Error!

Reference source not found.

𝐸𝑥𝑦 𝐿 𝑦‚𝑓 𝑥 𝑒𝑞. 7 5. Genetic Algorithm (GA)

A genetic algorithm is a technique of optimization that can be used to solve significant optimization issues. A novel hybrid algorithm, a combination of Genetic Algorithm optimization and some ensemble classifiers is proposed. While the ensemble classifiers consisting of a decision tree classifier, an RF classifier are used as the classification board, the Genetic Algorithm is used as the Random Sub-spacing (RS) tool and the feature selector to promote data classification. GA’s use in this algorithm is two-fold. On the one hand, GA acts as an RS tool and a selector of features to identify and rank different features based on their significance.

By using GA to build different subsets, different decision trees can be generated, and GA will select the favorite ones for later iterations. This will help the decision tree solve the pitfall of optimal local classification and define essential features.

Results

An early diagnosis of liver problems will increase the patient’s survival rate. A significant role in cancer research is to differentiate healthy patients from tumor patients and identify patients based on their cytogenetic profiles from specific cancer subtypes. That is identified as the problem of classification. Many machine learning algorithms are established to support literature with a medical diagnosis, but data sets contain missing values in several real-world tasks. Such missing values adversely affect the performance of the classifier. Our approach is to impute missing values with K-NN, NB, SVM, and RF by dividing the dataset to 25 percent test and 75 percent train to eliminate the sample bias. Then, the system’s performance is evaluated to detect liver cancer using Python language.

To achieve optimum accuracy for the classification methods, we integrate a genetic algorithm (GA) to adjust classifiers factors. For K-NN, the number of neighbors was 2, and after applying GA, its range will be from 1 to 50. For SVM, the regularization parameter was 1.02 and after applying GA, its range will be from 0 to 1500. For RF the number of trees in the forest was 129

(8)

and after applying GA, its range will be from 10 to 200. Error! Reference source not found. shows the changed factors of all classifiers before and after applying GA.

Table 2 factors for genetic algorithm

K-Nearest Neighbors Support Vector Machine Random forest Number of neighbors = 2

Power parameter for the Minkowski metric = 17

Regularization parameter = 1.02 The number of trees in the forest = 129

GA parameters Number of neighbors range = 1~50

Power parameter for the Minkowski metric = 1~50 Cross over probability = 0.7 Mutation probability = 0.7 Cross over method = new generation

selection method = uniform population size = 10 iterations =10

Regularization parameter = 0~1500 Cross over probability = 0.7 Mutation probability = 0.7 Cross over method = new generation

selection method = uniform population size = 50 iterations =10

The number of trees in the forest = 10~200

Cross over probability = 0.7 Mutation probability = 0.7 Cross over method = new generation

selection method = uniform population size = 10 iterations =5

In Error! Reference source not found. and figure 3, the comparison of all performance criteria for all algorithms using the feature selection technique is given. Performance of NB, SVM, K-NN, and RF classification algorithms are tested and evaluated by calculating error rate, sensitivity, specificity, prevalence, and accuracy. It can be seen from Error! Reference source not found. that, after optimizing the genetic algorithm for all classifiers, the RF classifier has obtained the best accuracy(73.81%) , specificity (83.33%) and the lowest error rate(26.19%) compared to other classifiers, the error rate for SVM is 28.57% , K-NN is 40.48% and for NB is 30.95%. After comparing the classifier’s performance, the features are ranked based on priority using the ranking algorithm for RF classifier to find useful attributes that can help in the early diagnosis of liver disease. Error! Reference source not found. summarizes the ranking of the attributes, AFP is a serum glycoprotein that was first recognized as a marker for HCC more than 40 years ago and has since been described to detect preclinical HCC [17]. Albumin and hemoglobin are important and independent prognosticator for many cancers, including HCC [26].

Table 3 comparative performance of all classifiers Evaluation

criteria

NB K-NN SVM RF

Error rate 30.95% 40.48% 28.57% 26.19%

Sensitivity 65.62% 62.50% 70.37% 70.00%

Specificity 80.00% 55.56% 73.33% 83.33%

Prevalence 76.19% 57.14% 64.29% 71.43%

Accuracy 69.05% 59.52% 71.43% 73.81%

(9)

Figure 4 comparative performance of all classifiers

Table 3 Ordering of attributes using ranking algorithm for RF classifier

Attributes Ranking

Alpha-Fetoprotein 1

Hemoglobin 2

Albumin 3

Alkaline phosphatase 4

Mean Corpuscular Volume 5

Aspartate transaminase(AST) 6

Ferritin 7

Platelets 8

Iron 9

International Normalized Ratio(INR) 10

Leukocytes 11

Creatinine 12

Age at diagnosis 13

Gamma glut amyl transferase(SGAT) 14

Ascites degree 15

Total Bilirubin 16

Major dimension of nodule 17

Performance Status 18

Alanine transaminase 19

Direct Bilirubin 20

Oxygen Saturation 21

Total Proteins 22

Packs of cigarettes per year 23

Grams of Alcohol per day 24

Number of Nodules 25

Symptoms 26

Arterial Hypertension 27

Alcohol 28

69.05%

59.52%

71.43% 73.81%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

NB K-NN SVM RF

COMPARITIVE PERFORMANE

(10)

Encephalopathy degree 29

Smoking 30

Endemic Countries 31

Portal Hypertension 32

Esophageal Varices 33

Splenomegaly 34

Hepatitis B Surface Antigen 35

Diabetes 36

Chronic Renal Insufficiency 37

Liver Metastasis 38

Hepatitis B Core Antibody 39

Portal Vein Thrombosis 40

Gender 41

Hepatitis C Virus Antibody 42

Obesity 43

Radiological Hallmark 44

Hemochromatosis 45

Cirrhosis 46

Discussions

Liver disease is any damage to the function of the liver that causes illness. Hepatic cirrhosis is the most prevalent risk factor for HCC. HCC can have varying patterns of growth. Some malignant tumors begin as a single tumor grows larger and only spreading later on to other parts of the liver.

Early intervention of liver problems may improve the survival rates in patients. Machine learning is not new to cancer science. There is a lot of achievements in the area of cancer detection and classification of the type of cancer.

According to our introduced RF classifier with GA and related studies of the characteristics of hepatic cancer [1][17]. Serum AFP level showed a direct significant correlation with the patient’s age, transaminases, degree of inflammation, fibrosis stage, and an inverse strong association with platelet count, serum albumin level, and total bilirubin [27].

Relationships with AFP serum levels such as serum albumin, AST, ALT, bilirubin, and platelets have been reported [27]. Baig et al. concluded that AFP was a significant HCC marker and also an HCC risks indicator, mostly in patients with cirrhosis and HCV/HBV infections [28] [29], parameters reported to be independent HCC prognostic factors in include bilirubin [29], albumin [29], international normalized ratio (INR) [26], alkaline phosphatase (ALP) [30], albumin- bilirubin (ALBI) grade [30] and ALP-to-platelet ratio [30].

Conclusion

Machine learning algorithms have been commonly used to analyze and extract useful information and patterns from massive datasets with noise and missing values in different arias. Automatic classification techniques can reduce the burden on doctors. In this work, KNN, NB, SVM, and RF Classification Algorithms were considered to evaluate their classification efficiency in terms of Accuracy, Precision, Error rate, Sensitivity, and Specificity classifying 165 HCC patient’s dataset.

Another interesting finding, however, is that proposed RF methodology combined with GA optimization technique provide better results than other three commonly used approaches to all

(11)

the previously identified performance measures, which can be attributed to more useful attributes such as alpha-fetoprotein, Hemoglobin, INR, Iron, Leukocytes, Total bilirubin, direct bilirubin, Indirect bilirubin, Albumin, Gender, Age and Total proteins are available in the 165HCC dataset.

It increases the classification algorithm’s efficiency to increase the number of characteristics, which can aid in the early detection and treatment of hepatic cancer. In conclusion, it is useful to evaluate serum AFP levels as a non-invasive measure of severity of liver dysfunction, the degree of inflammation, and the stage of fibrosis.

So The results suggest that the GA-Ensemble algorithm be a promising sample classification algorithm.

References

[1] Esraa. MH, and Mai SM, (2014).A study of Support Vector Machine Algorithm for Liver Disease Diagnosis, American Journal of Intelligent Systems, 4-9.

[2] Gatos I, Tsantis S, and Spiliopoulos S (2017). A Machine-Learning Algorithm Toward Color Analysis for Chronic Liver Disease Classification,Employing Ultrasound Shear Wave Elastography, Ultrasound in Medicine & Biology, 34:1797.

[3] Esraa MH, Mai M and Ayman ME (2015). Clinical and Genomic strategies for Detecting Hepatocellular Carcinoma in Early Stages: A systematic review, American Journal of Biomedical Engineering, 5:101.

[4] Saigo K, Yoshida K, and Ikeda R (2008). Integration of Hepatitis B Virus DNA Into the Myeloid/Lymphoid or Mixed-Lineage Leukemia (MLL4) Gene and Rearrangements of MLL4 in Human Hepatocellular Carcinoma, Hum Mutataion, 29:703.

[5] Petruzziello A, Marigliano S and, Loquercio G(2016). Global Epidemiology of Hepatitis C Virus Infection: An Up-Date of the Distribution and Circulation of Hepatitis C Virus Genotypes, World J Gastroenterology, 22:7824.

[6] Okazaki N, Yoshino M, and Yoshida T(1989). Evaluation of the Prognosis for Small Hepatocellular Carcinoma Based on Tumor Volume Doubling Time. A Preliminary Report, Cancer, 63:2207.

[7] Burke HB, Bostwick DG, Meiers I, and Montironi R (2005). Prostate Cancer Outcome:

Epidemiology and Biostatistics,Analytical and quantitative cytology and histology, 27:211.

[8] Santos MS,. Abreu PH and, García-Laencina PJ (2015). A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients, Journal of Biomedical Informatics, 58:49.

[9] Cruz JA, and, Wishart DS (2006). Applications of Machine Learning in Cancer Prediction

(12)

and Prognosis," Cancer Informatics, 2: 59.

[10] Cicchetti DV(1992). Neural Networks and Diagnosis in the Clinical Laboratory: State of the Art, Clin Chemistry,38: 9.

[11] Bocchi L, Coppini G and Nori J(2004). Detection of Single and Clustered Microcalcifications in Mammograms Using Fractals Models and Neural Networks, Medical engineering and physics, 26: 303.

[12] Sallehuddin R (2013). Cancer Detection Using Aritifical Neural Network and Support Vector Machine: A Comparative Study,Jurnal Teknologi, 65:73.

[13] Hassoon M, Kouhi S and Abdar M (2017). Rule Optimization of Boosted C5.0 Classification Using Genetic Algorithm for Liver disease Prediction, International Conference on Computer and Applications (ICCA), 299.

[14] Abdar M,Yen NY and. Hung JC(2018). Improving the Diagnosis of Liver Disease Using Multilayer Perceptron Neural Network and Boosted Decision Trees, Journal of Medical and Biological Engineering, 38:953.

[15] Ramana BV, Babu MP and, Venkateswarlu NB (2011).A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis, International Journal of Database Management Systems ,3.

[16] Han J, Kamber M, and Pei J(2011). Data Mining: Concepts and Techniques, USA: Elsevier.

[17] Zhang H (2004). The Optimality of Naive Bayes, in the Seventeenth International Florida Artificial Intelligence Research, Miami Beach, Florida, USA.

[18] Turhan CG, Kaya M and Yildiz O (2013). Breast Cancer Diagnosis Based on Naïve Bayes Machine Learning Classifier with KNN Missing Data Imputation, in 3rd World Conference on Innovation and Computer Sciences.

[19] Książek W, Abdar M and Acharya UR (2019). A Novel Machine Learning Approach for Early Detection of Hepatocellular Carcinoma Patients, Cognitive Systems Research, 54: 116.

[20] Sorich M J, Miners JO, McKinnon R and. Winkler DA(2019). Comparison of Linear and Nonlinear Classification Algorithms for the Prediction of Drug and Chemical Metabolism by Human UDP-glucuronosyltransferase Isoforms, Journal of chemical information and computer sciences, 43.

[21] DUREJA H, GUPTA S and MADAN AK (2008). Topological Models for Prediction of Pharmacokinetic Parameters of Cephalosporins using Random Forest, Decision Tree and Moving Average Analysis, Scientia Pharmaceutica, 76: 377.

[22] Pal M (2005). Random forest classifier for remote sensing classification, International

(13)

Journal of Remote Sensing, 26: 217.

[23] Cutler A, Cutler DR and Stevens JR (2005). Random Forests, in Ensemble Machine Learning: Methods and Applications, 157.

[24] Abd-Elfatah S, and Khalil F (2014). Evaluation of the role of alpha-efetoproten (AFP) levels in chronic viral hepatitis C patients,without hepatocellular carcinoma (HCC),al-azhar assiut medical journal,12.

[25] Baig JA,.Alam JM, Mahmood SR and, Baig M (2009). Hepatocellular Carcinoma (HCC) and Diagnostic Significance of A-fetoprotein (AFP), Journal of Ayub Medical College, Abbottabad : JAMC, 21: 72.

[26] Fox R, Berhane S, and Teng M (2014). Biomarker-based prognosis in hepatocellular carcinoma: validation and extension of the BALAD model, British Journal of Cancer, 110:

2090.

[27] Schöniger-Hekele M, Müller C and Kutilek M (2001). Hepatocellular Carcinoma in Central Europe: Prognostic Features and Survival, Gut, 48: 103.

[28] Wu SJ, Lin YX. and Ye H (2016). Prognostic Value of Alkaline Phosphatase, Gamma- Glutamyl Transpeptidase and Lactate Dehydrogenase in Hepatocellular Carcinoma Patients Treated With Liver Resection,International journal of surgery, 36: 143.

[29] Li M.X, Zhao H and Bi X.-Y(2017). Prognostic Value of the Albumin-Bilirubin Grade in Patients With Hepatocellular Carcinoma: Validation in a Chinese Cohort,Hepatology research : the official journal of the Japan Society of Hepatology ,47: 731.

[30] YQ Y, L. J and L. Y(2016). The preoperative alkaline phosphatase-to-platelet ratio index is an independent prognostic factor for hepatocellular carcinoma after hepatic resection,"

Medicine (Baltimore), 95-51.

Referințe

DOCUMENTE SIMILARE

Linear Regression, Support vector machine, Linear SVC, MLP Classifier, Stochastic Gradient, Decision Tree Classifier, Random forest classifier, XGB Classifier, LGBM

We use support vector machine, Extension extreme machine learning algorithm, Hybrid Random Forest Linear Model, Naïve Bayes, and deep Learning ANN algorithms in

Neural Networks (NN) , Random Forest and SVM algorithms.The output of rule-based techniques and machine learning algorithms is evaluated using regular datasets such

To build a model, four different machine learning classification algorithms were used: K-Nearest neighbor, Random Forest Tree and ExtratreeClassifier and ensemble

The model was developed using classification algorithms such as the support vector machine (SVM), decision tree, and random forest for breast cancer analyses.. Thesetypes

The dataset was highly imbalance, so we have implemented the basic supervised algorithms of machine learning Logistic Regression (LR), Decision Tree (DT), Naïve Bayes (NB),

The machine learning algorithm used to predict the traffic congestion is Random Forest Algorithm due to its high accuracy and

In the situation where failures are experienced, the genetic algorithm approach yields information about similar, but different, test cases that reveal faults in the software

We will be using genetic algorithms to identify the significant features and then use those features to train different classification models like k-Nearest

Survival analysis, Cox relative hazards model, Random forest algorithm, CoxRF turnover prediction algorithm. As a result, it's critical to investigate and

Fur-thermore, we present AI calculation specifically Support Vector Machine (SVM), Random Forest Classifier (RFC) to deal with the vehicle information and create steering

This paper gives the solution of predicting earlier diabetes prediction for the pregnancy women by applying classifier algorithm with Logistic regression, Support vector

The supervised machine learning algorithms like Support Vector Classifier, Decision Tree, Random Forest, k-Nearest neighbor, Logistic Regression, Naïve Bayes,

Finally, we compare and evaluate few machine learning algorithms in spark using RDD-based regression and classification methods for Random forest, decision tree,

The accuracy of different classification techniques such as Support Vector Machine (SVM), Decision Tree, Naive Bayes (NB), k Nearest Neighbors (k-NN),

The prediction and analysis of atherosclerosis disease machine learning applied four classification algorithm support vector machine, decision tree, naïve bayes and

Six supervised machine learning methods were used for this research: Random Forest Classifier, Support vector Machine, Logistic Regression, AdaBoost algorithm,

This Aims At Analyzing The Various Data Mining Techniques Namely Naive Bayes, Random Forest Classification, Decision Tree And Support Vector Machine By Using A

Each year an excessive number of people groups are dying because of heart disease.Heart disease can be happened because of the weakening of heart muscle.Likewise,

Also, this paper presents a comparative analysis of machine learning techniques like Random Forest (RF), Logistic Regression, Support Vector Machine (SVM), and Naïve Bayes in

In the first model, Principal Component Analysis (PCA) is applied to minimize the dimension of data and machine Learning algorithms like logistic Regression, Random forest

It is then contrasted to other conventional approaches like Support Vector Machine, Hybrid Random Forest, and the base model as well.. The projected prototype

The proposed IRF method as well as existing Random Forest (RF) and Support Vector Machines (SVM) have been implemented over the WEKA Machine learning simulation tool. The