**A Random Forest - Genetic algorithm integration approach for **
**Hepatocellular Carcinoma Early prediction **

**Esraa M. Hashem**^{1*}**, Mohamed Refaat Aboel-fotouh**^{2}

1Biomedical Engineering Department, Misr University for Science and Technology (MUST University), 6th of October, Egypt

2,

PhD, Industrial Engineer, Researcher

**ABSTRACT **

Malignant liver cancer is Hepatocellular carcinoma (HCC). It has a strong effect on individuals’ lives and may decrease the number of annual deaths by early investigating.There are liver disease forms, including fatty liver, cirrhosis, hepatitis, chronic liver disease, liver cancer, and hepatocellular carcinoma (HCC). As the incidence of this disease has risen dramatically in recent years, several researchers have also devoted great attention to this issue through expert systems and machine learning. Machi ne learning has been applied more recently to cancer prognosis and prediction. In this work, machine learning methods based on a k- nearest neighbor, support vector machine, naïve Bayes, and random forest classifiers with the integration of genetic algorithm for classifying 165 liver cancer patients with specific 49 combinations of features, testing all classifiers by assessing their output based on accuracy, error rate, sensitivity, prevalence, and specificity. Results show that the random forest with parameters optimization using genetic algorithm (GA) yielded the highest accuracy, and specificity indicating improvement over the classical approaches commonly used in predictive models of Hepatocellular Carcinoma. The finding referred to various useful attributes such as alpha-fetoprotein, hemoglobin, alkaline phosphatase, total bilirubin, albumin, and total proteins, which can help in early prediction of hepatocellular carcinoma.

**Keywords **

machine learning; hepatocellular carcinoma; random forest; genetic algorithm.

**Introduction (Times New Roman, bold, 12) **

The liver is the human body’s largest internal organ, and plays a significant role in metabolism and has many essential functions. Hepatic disease raises a variety of problems for medical care delivery. Liver disease refers to many diseases and disorders that can caused impaired liver function, reducing liver function. Dysfunction may be primary, but the liver is also secondarily affected by other organ systems disorder since it is involved in many processes of metabolism and detoxification [1]. There are various forms of liver disease, including fatty liver, cirrhosis, hepatitis, chronic liver disease, liver cancer, and hepatocellular carcinoma (HCC) [2].

According to the latest data, HCC is one of the deadliest cancers in the world that causes more than 600,000 deaths annually [3]. The disease is usually less in females, with two to three in males due to the higher risk factors in males and likely epigenetic factors [4]. HCC has been reported as the third prevalent cause of death from cancer worldwide and ultimately dead patients with cirrhosis. It typically occurs in patients with chronic hepatic disease and/or cirrhosis.

The key causes for developing HCC are viral infection with Hepatitis B and C, alcohol, and aflatoxin B1, which are the most common underlying cause of chronic hepatic disease, which leads to cirrhosis of the liver and chronic hepatitis [4]. HCV is a single RNA virus that has a positive-stranded. These risk factors can induce damage in DNA sequences and mutations, such as aflatoxin induced p53 mutation and DNA damage caused by HBV genome insertion [5].

Infection with HCV affects 3:4 million people worldwide, and about 170 million people are chronically diagnosed with virus C [6]. For all cases, the tumor size, when HCC is first identified does not predict the progression of the disease. In fact, for a small HCC, the median period of doubling volume may range from 1 to 20 months [7]. Detection and characterization of tumor

vascularity are critical in the differential ways: diagnosis, the choice of treatment method, and HCC therapeutic response assessment. The clinical behavior of HCC is difficult to estimate.

Therefore, there is a critical demand to obtain new techniques to assess the early detection of hepatic cancer patients [8].

To date, there has been no clear evidence emerging that increasing survival gain with e high-risk patient surveillance. Diagnostic tools widely used to enhance patient survival are alpha- fetoprotein (AFP) serum tumor marker, radiographic imaging, liver biopsy, and biomarkers [4].

Nonetheless, a cancer prognosis usually includes several physicians from various specialties who use various subsets of biomarkers and multiple clinical variables, including the patient’s age, general health, the location and type of cancer, and the tumor grade and size [9]. Concerning both the hepatitis virus, their corresponding key markers include the measurements of different antigens and antibodies. At the same time, cirrhosis is typically measured with the Child-Pugh (CP) score, which employs five clinical measures of hepatic disease (Total Bilirubin, Albumin, Encephalopathy, Ascites and Prothrombin Time). Cirrhosis occurs in more than 80% of HCC cases and is specifically known as the primary cause of this disease [10].

Machine learning is an artiﬁcial intelligence branch that applies various statistical, optimization and probabilistic techniques that enable computers to “learn” from previous examples and identify patterns that are difficult to distinguish from big, complex or noisy data sets [11].

Machine learning is not new to cancer science. Decision trees (DTs) have been used for nearly 20 years for cancer diagnosis and detection [12].

Machine learning methods are used today in a wide range of applications ranging from the detection and classification of tumors via X-ray and CT images [13] to the classiﬁcation of malignancies from genomic (microarray) assays and proteomic. Recently, machine learning has been applied to early disease detection and diagnosis, and there are several types of classification algorithm used for cancer diagnosis or, generally known as classifiers. The Artificial Neural Network (ANN), Support Vector Machine (SVM), Genetic Algorithm (GA), Fuzzy Set (FS), and Rough Set (RS) are some of these. They are used for the classification of cancer data as malignant tumors and benign tumors [9].

This work focuses on the pre-diagnosis of 156 patients with Hepatocellular Carcinoma using machine learning, k-nearest neighbor (K-NN), naive Gaussian Bayes(NB), support vector machine (SVM), and random forest (RF) as a genetic algorithm classification algorithm to improve classifiers accuracy and ordering all 49 attributes affecting HCC diagnosis.

**Methods **

**Cases study **

The work was carried out using the dataset provided by Coimbra’s Hospital and University
Centre (CHUC), Portuga [14]. The dataset contains165 confirmed patients with HCC. Table 1
provided an overview of the heterogeneous dataset consisting of 23 quantitative features and 26
qualitative features (n=23+26=49). In **Error! Reference source not found.**, only 4.85% of the dataset
has complete information about all the features. (49 clinical variables, including ratio-scaled,
dichotomous and ordinal variables). Overall, missing data represents 10.22% of the whole
dataset.

The gender, symptoms, alcohol, cirrhosis, smoking, diabetes, obesity, hemochromatosis, arterial hypertension, chronic renal insufficiency, human immunodeficiency virus, nonalcoholic steatohepatitis, esophageal varices, splenomegaly, portal hypertension, portal vein thrombosis,

liver metastasis, radiological hallmark, and class are nominal values. The age at diagnosis and the number of nodules are integer values. The performance status, encefalopathy degree, and ascites degree are ordinal values while Grams of Alcohol per day, Packs of cigarettes per year, International Normalized Ratio, Alpha-Fetoprotein (ng/mL), Hemoglobin (g/dL), Mean Corpuscular Volume (fl), Leukocytes(G/L), Platelets (G/L), Albumin (mg/dL), Total Bilirubin(mg/dL), Alanine transaminase (U/L), Aspartate transaminase (U/L), Gamma glutamyl transferase (U/L), Alkaline phosphatase (U/L), Total Proteins (g/dL), Creatinine (mg/dL), Major dimension of nodule (cm), Direct Bilirubin (mg/dL), Iron (mcg/dL), Oxygen Saturation (%), and Ferritin (ng/mL) are continues values.

According to the EASL-EORTC (European Association for the Study of the Liver – European Organization for Research and Treatment of Cancer) Clinical Practice Guidelines, they have been selected. The survival target variable is encoded as a binary variable with values 0 and 1, meaning a patient has not survived or survived, respectively. Apart from statistical methods, missing values are imputed by calculating the median, mean, and most frequented data. NB, SVM, RF, and k-NN have commonly used machine learning classifiers in all data to offer the best classifier give the best accuracy.

Table 1 HCC data sets and attributes

**Boolean and categorical data **
**(most frequent) **

**Integer data **
**(median) **

**Decimal data **
**(mean) **

Gender Age at diagnosis International Normalized Ratio:

INR

Symptoms Grams of Alcohol per day

(Grams/day) Alpha-Fetoprotein (ng/mL): AFP

Alcohol Packs of cigarettes per year

(Packs/year) Hemoglobin (g/dL)

Hepatitis B Surface Antigen: HBsAg Platelets Mean Corpuscular Volume (fl):

MCV

Hepatitis B e Antigen: HBeAg Alanine transaminase (U/L) Leukocytes (G/L) Hepatitis B Core Antibody: HBcAb Aspartate transaminase (U/L) Albumin (mg/dL) Hepatitis C Virus Antibody: HCVAb Gamma glut amyl transferase

(U/L) Total Bilirubin (mg/dL)

Cirrhosis Alkaline phosphatase (U/L) Creatinine (mg/dL)

Endemic Countries Number of Nodules Major dimension of nodule (cm)

Smoking Iron (mcg/dL) Direct Bilirubin (mg/dL)

Diabetes Oxygen Saturation % Ferritin

Obesity Performance Status Total Proteins TP

Hemochromatosis Encephalopathy degree

Arterial Hypertension: AHT Ascites degree Chronic Renal Insufficiency: CRI

Human Immunodeficiency Virus: HIV Nonalcoholic Steatohepatitis: NASH Esophageal Varices

Splenomegaly Portal Hypertension Portal Vein Thrombosis Liver Metastasis Radiological Hallmark

**Methodology **

To find the best techniques to pre-diagnose cancer, the researchers carried out multiple comparative studies in cancer classification. The outcome obtained from the previous studies is, However, consistent [9]. Numerous studies have been made on HCC and hepatic diseases. Santos et al. [14] were testing HCC using both heterogeneous and missing data (HEOM) and clustering techniques (K-means). The findings showed the proposed would effectively detect the HCC.

Hassoon et al. [15] introduced a new approach for optimizing the rules developed using a boosted C5.0 genetic algorithm to detect liver disease in good time. The genetic algorithm aims not to generate rules but to delete unnecessary limits. The suggestion solution improved C5.0 accuracy from 81% to 93% [16].

Our proposal classifiers (SVM, NB, KNN, RF) are implemented in Python language and measured by performance criteria such as error rate, sensitivity, specificity, accuracy, and prevalence, which can be defined as follows. Equation 1 a classifier error rate of the proportion of the test set that is classifier incorrectly classifies. Accuracy: Accuracy is the percentage of correct classifications (1-error rate).

𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 =𝐼𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 × 100 𝑒𝑞. 1

Equation 2 Sensitivity is referred as True positive rate

𝑆𝑒𝑛𝑒𝑠𝑡𝑖𝑣𝑖𝑦 =𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 × 100

Equation 3 Specificity is the correct negative rate which is the proportion of true negative samples.

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 × 100 𝑒𝑞. 3

Equation 4 Prevalence is defined as the proportion of the true positives against the results of entire sample.

𝑝𝑟𝑒𝑣𝑎𝑙𝑒𝑛𝑐𝑒 = 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠

𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠× 100 𝑒𝑞. 4

The data used in this work and the four well-known machine learning methods are described
in **Error! Reference source not found.**.

Raw Data 156 Patient 49 Attributes

**Pre-Processing**

Missing Boolean and categorical data (complete with most frequent value of a given attribute)

Missing integer data (calculate the median value for attributes)

Missing decimal data ( calculate mean for attributes)

**Machine learning classifiers**

**Gaussian Naive **
**Bayes**

**K-Nearest **
**Neighbors (K-NN)**

**Support Vector **
**Machine (SVM) **

**Random Forest **
**(RF)**

**Evaluation criteria**
Error rate,
Sensitivity,
Specificity,
Prevalence,

Accuracy

**Ranking **
**Features passed on **

**best classifier**
**Genetic **

**Algorithm(GA)**
Optimizing Parameters

Number of neighbors &

Power parameter for the Minkowski metric

Regularization parameter

The number of trees in the forest

Figure 1 overall process

**1. ** **Naive Bayes(NB) Algorithm **

The naive Gaussian Bayes method (NB) is a simple approach to probabilistic inference that has been applied successfully in a variety of machine learning applications [17].

Naïve Bayes Classifier is a recognition model based on the Bayes rule. NB Classifier is known to
be better than some other classification methods. The Bayes formula is the basis of the NB
theorem used as **Error! Reference source not found.**, [18].

𝑷 𝒄 𝒙 =𝒑 𝒄 𝒑 𝒙 𝒄

𝒑(𝒙) 𝑾𝒉𝒆𝒓𝒆: 𝒙 = 𝒙_{𝟏}. 𝒙_{𝟐}. 𝒙_{𝟑}. ⋯ 𝒙_{𝒏} 𝒆𝒒. 𝟓
Where x is attributed, C is classes, 𝑃(𝐶|𝑋): the probability of even 𝐶 given 𝑋 has occurred,
𝑃(𝑋|𝐶): the probability of even 𝑋 given 𝐶 has occurred, 𝑃(𝐶): the probability of event C, and
𝑃(𝑋): probability of event X.With the substitution of X, the Bayes formula can be written as
follows **Error! Reference source not found.**[19].

𝑃 𝑐 𝑥_{1}. 𝑥_{2}. ⋯ 𝑥_{𝑛} =𝑝 𝑐 𝑝(𝑥_{1}. 𝑥_{2}. ⋯ 𝑥_{𝑛}|𝑐)

𝑝(𝑥_{1}. 𝑥_{2}. ⋯ 𝑥_{𝑛}) 𝑒𝑞. 6

In the training phase, the class label of a testing data point is calculated using the conditional probabilities and the class probabilities. In two categorized data sets, the data point is ranked according to which class probability is greater [20].

**2. ** **K-Nearest Neighbors (K-NN) Algorithm **

K-NN has been one of the supervised learning methods used in statistical pattern recognition, data mining, and many other. It follows a methodology for classifying objects in feature space based on nearby training examples [17]. This algorithm keeps current cases in their entirety existing, whereas new cases are classified according to a similarity calculation. It implements the construction of multi-dimensional feature space, in which the different dimensions are correlated with different signal features [21].

**3. ** **Support Vector Machine (SVM) Algorithm **

SVM is an efficient mechanism, which can be applied to both classification and regression. SVM
divides the data into two categories of classification efficiency and N-dimensional hyperplane
construction. These models are strongly linked to conventional neural networks of multilayer
perceptron [1]. In the SVM literature, an independent variable named attribute and a transformed
attribute used to describe the hyperplane is called a feature [22]. In the classification of cancer,
the classes will be classified into benign and malignant tumors. The purpose of this design is to
find out the optimal hyperplane that divides vector clusters in such a form that cases with one
target variable category are on one side of the plane and cases with the other category is on the
other side of the plane scale. The vectors near the hyperplane are the support vectors [17]. As
shown in **Error! Reference source not found.**.

Figure 2 support victor machine classifier

**4. ** **Random Forest (RF) Classifier **

This is a group of unpruned classification trees generated in tree induction by utilizing a random selection of features and bootstrap iterations of the training dataset. [23]. Unlike other decision trees (DT) algorithms, through the uses of integration of various features, a generated tree is grown to the maximum possible depth during recent training datasets. Those fully grown trees are not pruned [24].

Random forest is a combination of unpruned classification trees generated by utilizing training data bootstrap samples and tree induction random selection of features. For dimensional random vector X= (X1..., Xp) T expressing the real-valued input or predictor variables and a random variable Y expressing the real-valued response, assume an unknown joint distribution PXY (X, Y).

The objective is to find a predictive (X) function for estimating Y. The correlation function is
described by a loss function L (Y, f (X)), and the expected loss values are minimized [25]. **Error! **

**Reference source not found.**

𝐸𝑥𝑦 𝐿 𝑦‚𝑓 𝑥 𝑒𝑞. 7
**5. ** **Genetic Algorithm (GA) **

A genetic algorithm is a technique of optimization that can be used to solve significant optimization issues. A novel hybrid algorithm, a combination of Genetic Algorithm optimization and some ensemble classifiers is proposed. While the ensemble classifiers consisting of a decision tree classifier, an RF classifier are used as the classification board, the Genetic Algorithm is used as the Random Sub-spacing (RS) tool and the feature selector to promote data classification. GA’s use in this algorithm is two-fold. On the one hand, GA acts as an RS tool and a selector of features to identify and rank different features based on their significance.

By using GA to build different subsets, different decision trees can be generated, and GA will select the favorite ones for later iterations. This will help the decision tree solve the pitfall of optimal local classification and define essential features.

**Results **

An early diagnosis of liver problems will increase the patient’s survival rate. A significant role in cancer research is to differentiate healthy patients from tumor patients and identify patients based on their cytogenetic profiles from specific cancer subtypes. That is identified as the problem of classification. Many machine learning algorithms are established to support literature with a medical diagnosis, but data sets contain missing values in several real-world tasks. Such missing values adversely affect the performance of the classifier. Our approach is to impute missing values with K-NN, NB, SVM, and RF by dividing the dataset to 25 percent test and 75 percent train to eliminate the sample bias. Then, the system’s performance is evaluated to detect liver cancer using Python language.

To achieve optimum accuracy for the classification methods, we integrate a genetic algorithm (GA) to adjust classifiers factors. For K-NN, the number of neighbors was 2, and after applying GA, its range will be from 1 to 50. For SVM, the regularization parameter was 1.02 and after applying GA, its range will be from 0 to 1500. For RF the number of trees in the forest was 129

and after applying GA, its range will be from 10 to 200. **Error! Reference source not found.** shows
the changed factors of all classifiers before and after applying GA.

Table 2 factors for genetic algorithm

**K-Nearest Neighbors ** **Support Vector Machine ** **Random forest **
Number of neighbors = 2

Power parameter for the Minkowski metric = 17

Regularization parameter = 1.02 The number of trees in the forest = 129

**GA parameters **
Number of neighbors range = 1~50

Power parameter for the Minkowski metric = 1~50 Cross over probability = 0.7 Mutation probability = 0.7 Cross over method = new generation

selection method = uniform population size = 10 iterations =10

Regularization parameter = 0~1500 Cross over probability = 0.7 Mutation probability = 0.7 Cross over method = new generation

selection method = uniform population size = 50 iterations =10

The number of trees in the forest = 10~200

Cross over probability = 0.7 Mutation probability = 0.7 Cross over method = new generation

selection method = uniform population size = 10 iterations =5

In **Error! Reference source not found.** and figure 3, the comparison of all performance criteria for all
algorithms using the feature selection technique is given. Performance of NB, SVM, K-NN, and
RF classification algorithms are tested and evaluated by calculating error rate, sensitivity,
specificity, prevalence, and accuracy. It can be seen from **Error! Reference source not found.** that,
after optimizing the genetic algorithm for all classifiers, the RF classifier has obtained the best
accuracy(73.81%) , specificity (83.33%) and the lowest error rate(26.19%) compared to other
classifiers, the error rate for SVM is 28.57% , K-NN is 40.48% and for NB is 30.95%. After
comparing the classifier’s performance, the features are ranked based on priority using the
ranking algorithm for RF classifier to find useful attributes that can help in the early diagnosis of
liver disease. **Error! Reference source not found.** summarizes the ranking of the attributes, AFP is a
serum glycoprotein that was first recognized as a marker for HCC more than 40 years ago and
has since been described to detect preclinical HCC [17]. Albumin and hemoglobin are important
and independent prognosticator for many cancers, including HCC [26].

Table 3 comparative performance of all classifiers Evaluation

criteria

NB K-NN SVM RF

**Error rate ** 30.95% 40.48% 28.57% **26.19% **

**Sensitivity ** 65.62% 62.50% 70.37% 70.00%

**Specificity ** 80.00% 55.56% 73.33% 83.33%

**Prevalence ** 76.19% 57.14% 64.29% 71.43%

**Accuracy ** 69.05% 59.52% 71.43% **73.81% **

Figure 4 comparative performance of all classifiers

Table 3 Ordering of attributes using ranking algorithm for RF classifier

*Attributes * *Ranking *

*Alpha-Fetoprotein * 1

*Hemoglobin * 2

*Albumin * 3

*Alkaline phosphatase * 4

*Mean Corpuscular Volume * 5

*Aspartate transaminase(AST) * 6

*Ferritin * 7

*Platelets * 8

*Iron * 9

*International Normalized Ratio(INR) * 10

*Leukocytes * 11

*Creatinine * 12

*Age at diagnosis * 13

*Gamma glut amyl transferase(SGAT) * 14

*Ascites degree * 15

*Total Bilirubin * 16

*Major dimension of nodule * 17

*Performance Status * 18

*Alanine transaminase * 19

*Direct Bilirubin * 20

*Oxygen Saturation * 21

*Total Proteins * 22

*Packs of cigarettes per year * 23

*Grams of Alcohol per day * 24

*Number of Nodules * 25

*Symptoms * 26

*Arterial Hypertension * 27

*Alcohol * 28

69.05%

59.52%

71.43% 73.81%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

NB K-NN SVM RF

**COMPARITIVE PERFORMANE **

*Encephalopathy degree * 29

*Smoking * 30

*Endemic Countries * 31

*Portal Hypertension * 32

*Esophageal Varices * 33

*Splenomegaly * 34

*Hepatitis B Surface Antigen * 35

*Diabetes * 36

*Chronic Renal Insufficiency * 37

*Liver Metastasis * 38

*Hepatitis B Core Antibody * 39

*Portal Vein Thrombosis * 40

*Gender * 41

*Hepatitis C Virus Antibody * 42

*Obesity * 43

*Radiological Hallmark * 44

*Hemochromatosis * 45

*Cirrhosis * 46

**Discussions **

Liver disease is any damage to the function of the liver that causes illness. Hepatic cirrhosis is the most prevalent risk factor for HCC. HCC can have varying patterns of growth. Some malignant tumors begin as a single tumor grows larger and only spreading later on to other parts of the liver.

Early intervention of liver problems may improve the survival rates in patients. Machine learning is not new to cancer science. There is a lot of achievements in the area of cancer detection and classification of the type of cancer.

According to our introduced RF classifier with GA and related studies of the characteristics of hepatic cancer [1][17]. Serum AFP level showed a direct significant correlation with the patient’s age, transaminases, degree of inflammation, fibrosis stage, and an inverse strong association with platelet count, serum albumin level, and total bilirubin [27].

Relationships with AFP serum levels such as serum albumin, AST, ALT, bilirubin, and platelets have been reported [27]. Baig et al. concluded that AFP was a significant HCC marker and also an HCC risks indicator, mostly in patients with cirrhosis and HCV/HBV infections [28] [29], parameters reported to be independent HCC prognostic factors in include bilirubin [29], albumin [29], international normalized ratio (INR) [26], alkaline phosphatase (ALP) [30], albumin- bilirubin (ALBI) grade [30] and ALP-to-platelet ratio [30].

**Conclusion **

Machine learning algorithms have been commonly used to analyze and extract useful information and patterns from massive datasets with noise and missing values in different arias. Automatic classification techniques can reduce the burden on doctors. In this work, KNN, NB, SVM, and RF Classification Algorithms were considered to evaluate their classification efficiency in terms of Accuracy, Precision, Error rate, Sensitivity, and Specificity classifying 165 HCC patient’s dataset.

Another interesting finding, however, is that proposed RF methodology combined with GA optimization technique provide better results than other three commonly used approaches to all

the previously identified performance measures, which can be attributed to more useful attributes such as alpha-fetoprotein, Hemoglobin, INR, Iron, Leukocytes, Total bilirubin, direct bilirubin, Indirect bilirubin, Albumin, Gender, Age and Total proteins are available in the 165HCC dataset.

It increases the classification algorithm’s efficiency to increase the number of characteristics, which can aid in the early detection and treatment of hepatic cancer. In conclusion, it is useful to evaluate serum AFP levels as a non-invasive measure of severity of liver dysfunction, the degree of inflammation, and the stage of fibrosis.

So The results suggest that the GA-Ensemble algorithm be a promising sample classification algorithm.

**References **

[1] Esraa. MH, and Mai SM, (2014).A study of Support Vector Machine Algorithm for Liver Disease Diagnosis, American Journal of Intelligent Systems, 4-9.

[2] Gatos I, Tsantis S, and Spiliopoulos S (2017). A Machine-Learning Algorithm Toward Color Analysis for Chronic Liver Disease Classification,Employing Ultrasound Shear Wave Elastography, Ultrasound in Medicine & Biology, 34:1797.

[3] Esraa MH, Mai M and Ayman ME (2015). Clinical and Genomic strategies for Detecting
Hepatocellular Carcinoma in Early Stages: A systematic review, *American Journal of *
*Biomedical Engineering, 5:101. *

[4] Saigo K, Yoshida K, and Ikeda R (2008). Integration of Hepatitis B Virus DNA Into the Myeloid/Lymphoid or Mixed-Lineage Leukemia (MLL4) Gene and Rearrangements of MLL4 in Human Hepatocellular Carcinoma, Hum Mutataion, 29:703.

[5] Petruzziello A, Marigliano S and, Loquercio G(2016). Global Epidemiology of Hepatitis C Virus Infection: An Up-Date of the Distribution and Circulation of Hepatitis C Virus Genotypes, World J Gastroenterology, 22:7824.

[6] Okazaki N, Yoshino M, and Yoshida T(1989). Evaluation of the Prognosis for Small
Hepatocellular Carcinoma Based on Tumor Volume Doubling Time. A Preliminary Report,
*Cancer, 63:2207. *

[7] Burke HB, Bostwick DG, Meiers I, and Montironi R (2005). Prostate Cancer Outcome:

Epidemiology and Biostatistics,Analytical and quantitative cytology and histology, 27:211.

[8] Santos MS,. Abreu PH and, García-Laencina PJ (2015). A new cluster-based oversampling
method for improving survival prediction of hepatocellular carcinoma patients, *Journal of *
*Biomedical Informatics, 58:49. *

[9] Cruz JA, and, Wishart DS (2006). Applications of Machine Learning in Cancer Prediction

and Prognosis," Cancer Informatics, 2: 59.

[10] Cicchetti DV(1992). Neural Networks and Diagnosis in the Clinical Laboratory: State of the Art, Clin Chemistry,38: 9.

[11] Bocchi L, Coppini G and Nori J(2004). Detection of Single and Clustered
Microcalcifications in Mammograms Using Fractals Models and Neural Networks, Medical
*engineering and physics, 26: 303. *

[12] Sallehuddin R (2013). Cancer Detection Using Aritifical Neural Network and Support Vector Machine: A Comparative Study,Jurnal Teknologi, 65:73.

[13] Hassoon M, Kouhi S and Abdar M (2017). Rule Optimization of Boosted C5.0
Classification Using Genetic Algorithm for Liver disease Prediction, *International *
*Conference on Computer and Applications (ICCA), 299. *

[14] Abdar M,Yen NY and. Hung JC(2018). Improving the Diagnosis of Liver Disease Using
Multilayer Perceptron Neural Network and Boosted Decision Trees, Journal of Medical and
*Biological Engineering, 38:953. *

[15] Ramana BV, Babu MP and, Venkateswarlu NB (2011).A Critical Study of Selected
Classification Algorithms for Liver Disease Diagnosis, *International Journal of Database *
*Management Systems ,3. *

[16] Han J, Kamber M, and Pei J(2011). Data Mining: Concepts and Techniques, USA: Elsevier.

[17] Zhang H (2004). The Optimality of Naive Bayes, in *the Seventeenth International Florida *
*Artificial Intelligence Research, Miami Beach, Florida, USA. *

[18] Turhan CG, Kaya M and Yildiz O (2013). Breast Cancer Diagnosis Based on Naïve Bayes
Machine Learning Classifier with KNN Missing Data Imputation, in *3rd World Conference *
*on Innovation and Computer Sciences. *

[19] Książek W, Abdar M and Acharya UR (2019). A Novel Machine Learning Approach for Early Detection of Hepatocellular Carcinoma Patients, Cognitive Systems Research, 54: 116.

[20] Sorich M J, Miners JO, McKinnon R and. Winkler DA(2019). Comparison of Linear and
Nonlinear Classification Algorithms for the Prediction of Drug and Chemical Metabolism by
Human UDP-glucuronosyltransferase Isoforms, *Journal of chemical information and *
*computer sciences, 43. *

[21] DUREJA H, GUPTA S and MADAN AK (2008). Topological Models for Prediction of Pharmacokinetic Parameters of Cephalosporins using Random Forest, Decision Tree and Moving Average Analysis, Scientia Pharmaceutica, 76: 377.

[22] Pal M (2005). Random forest classifier for remote sensing classification, *International *

*Journal of Remote Sensing, 26: 217. *

[23] Cutler A, Cutler DR and Stevens JR (2005). Random Forests, in *Ensemble Machine *
*Learning: Methods and Applications, 157. *

[24] Abd-Elfatah S, and Khalil F (2014). Evaluation of the role of alpha-efetoproten (AFP) levels
in chronic viral hepatitis C patients,without hepatocellular carcinoma (HCC),al-azhar assiut
*medical journal,12. *

[25] Baig JA,.Alam JM, Mahmood SR and, Baig M (2009). Hepatocellular Carcinoma (HCC)
and Diagnostic Significance of A-fetoprotein (AFP), *Journal of Ayub Medical College, *
*Abbottabad : JAMC, 21: 72. *

[26] Fox R, Berhane S, and Teng M (2014). Biomarker-based prognosis in hepatocellular carcinoma: validation and extension of the BALAD model, British Journal of Cancer, 110:

2090.

[27] Schöniger-Hekele M, Müller C and Kutilek M (2001). Hepatocellular Carcinoma in Central Europe: Prognostic Features and Survival, Gut, 48: 103.

[28] Wu SJ, Lin YX. and Ye H (2016). Prognostic Value of Alkaline Phosphatase, Gamma- Glutamyl Transpeptidase and Lactate Dehydrogenase in Hepatocellular Carcinoma Patients Treated With Liver Resection,International journal of surgery, 36: 143.

[29] Li M.X, Zhao H and Bi X.-Y(2017). Prognostic Value of the Albumin-Bilirubin Grade in
Patients With Hepatocellular Carcinoma: Validation in a Chinese Cohort,Hepatology
*research : the official journal of the Japan Society of Hepatology ,47: 731. *

[30] YQ Y, L. J and L. Y(2016). The preoperative alkaline phosphatase-to-platelet ratio index is an independent prognostic factor for hepatocellular carcinoma after hepatic resection,"

*Medicine (Baltimore), 95-51. *