An Enhanced Filter Based Heterogeneous Data Classification Framework on Mixed Breast Cancer Databases
Anusha Derangula1*, Prof. SrinivasaReddy Edara2
1,2Department of Computer Science and Engineering, AcharyaNagarjuna University, Andhra Pradesh, India
* Corresponding author’s Email: [email protected]
Medical disease classification is one of the major challenges to scientific and medical researchers. Due to its high dimensional feature space and data imbalance, most of the medical databases contain many homogeneous or heterogeneous features. It is difficult to predict the disease label due to class imbalance. Feature transformation, feature ranking and data classification are the essential approaches used to classify the high dimensional data with a high true positive rate. Feature transformation helps to improve the feature ranking process in high dimensional feature space. Most traditional feature transformation approaches such as min-max variance, probabilistic normalization and min-max normalization, etc., are independent of data distribution and multi-class labels. In this work, a novel feature selection based random forest classifier is proposed to improve the efficiency of the medical datasets.
Practical results proved that the proposed heterogeneous classification framework has better accuracy, nearly 98.6 % accuracy compared to the traditional nominal data classification models.
Keywords: Breast cancer, heterogeneous databases, classification model, feature ranking.
Machine learning is the process of identifying and analyzing the unknown hidden patterns and their relationships on large uncertain databases. Machine learning is the art of computer science without being programmed explicitly. Machine learning has allowed us to recognize practical speech, self-driven cars, efficient web research, and humans' perception enormously improved in the early decade. This is so persistent today that it may be applied without knowledge several times a day. Most researchers are involved and believe it is an excellent way to develop artificial intelligence towards the human level. Cancer research is
one of the main fields of research in the field of medicine. The predictability of different cancer types is important for better treatment and minimization of severity for patients.
Therefore, microarrays can be used for the classification of different cancers and for the prediction. In the classification of lymphoma, leukemia, breast cancer, and liver cancer, for example, gene expression data have been employed to obtain good results. Microarray technology provides a new tool for automating the diagnostic work and improving the precise traditional diagnosis techniques. The expression of thousands of genes can be examined at once with microarrays. Higher expression testing of certain genes can help cancer predict.
The problem in the analysis of microarrays, however, is that gene expression data are ultra- highly dimensional (microarray image).
Microarrays' high dimension makes it extremely difficult to process them and their complexity in time and space. Therefore, it is important to reduce the data dimensionality before further processing in order to make processing microarray feasible.
As mentioned, a number of methods for classifying cancer types with gene expression are developed and researched. However, most of the studies were confined to the problem of binary gene selection and very few considered feature selection and classification in multiple classes. This is because the selection and classification of multi-class genes are significantly more difficult than binary problems. Notably, the most effective and useful classifiers for the accurate diagnosis of cancer by microarray gene expression data are multi-class cancer classifiers such as random forest and support vector machine. SVMs can only be used for binary classification tasks in the first generation. Most real-life diagnostic tasks, however, particularly cancer and not binary classes. In recent years, several algorithms have been implemented to classify the multiple classes along with statistical approaches, the Evolutionary Algorithm, near-K(KNN), naive bays (NB), neural networks (NN), and decision tree (DT), SVM (Support Vector Machine), ELM(Extreme Learning Machine) and so on. In this area, many algorithms for gene selection were proposed . The gene selection process is challenged by a large number of genes and the small number of samples. The T- score between classes and gene expressions is common in the analysis of gene expression in microarrays. The t-test of two samples is a test parameter that tests whether two datasets from the same distribution have been sampled (or have the same mean). In the context of the analysis of difference of expression, the values of expression across two classes for a specific gene are assumed to be of unequal sample size with an unequal difference. Therefore, an unpaired t-test on expression array data is usually carried out. The t statistics are directed to
the medium differences between the interclass and inverse to the inter-classes' standard deviations. Small standard deviations in the intrinsic classes and a large interclass mean difference show a good class gene (small p-value). Based on the overlap of distributions, a p-value is determined. The microarray gene classification of cancer is a major problem.
As the size of the medical databases increases, traditional machine learning models such as decision tree, SVM, neural networks, naive bayes, fuzzy ensemble learning , etc.
become difficult to process the patterns due to noise, high dimensionality and non-relational instances in the medical databases. Also, the major challenge of the existing models includes disease pattern discovery and quality services . Feature selection and classification are the essential requirements for most of the medical disease pattern discovery models.
Generally, the SVM classification scheme is based upon the characteristics of the statistical learning mechanism. The classification of the SVM classifier supports structural risk minimization to carry out the whole process of classification smoothly and effectively.
Another classification method is called bagging. In this, the classifier is able to give an output of a category guess. Each prediction done is considered as a single vote. If a given class gains the majority of the votes, it is then considered the classification's output. The bagging aggregates the classes based on the number of votes. There are some other classifiers that have been derived by improving this bagging method. This is best performed using trees.
This is because the structure of trees is easy to interpret.
High dimensionality and imbalance are the key problems of medical datasets.
Traditional classifiers of machine learning consider a subset of classification and disease prediction characteristics with a high true negative rate and error rates. The medical field is considered one of the most information-intensive domains in which clinical-related data and knowledge are regularly developed. An example of such a complex system is the development of an integrated healthcare system model. Healthcare information systems collect and segregate patients ' clinical history, including attributes, patient demographic data, critical functionalities, test inferences, and unstructured data such as audio and video records.
For the medical field and for patients, proper analysis of such information is vital. Intelligent analysis of such aggregated data such as rapid diagnosis of disease, optimum treatment selection for patients, duration of patient treatment and its outcomes, complex risk determination and other optimization of the use of medical resources can perform various tasks. Complete computerization of disease diagnosis and treatment in recent decades allows
rapid and effective aggregation. A healthcare disease dataset may contain numerous attributes, and many of these attributes may not contribute to an algorithm's classification accuracy during diagnosis. In addition, due to the presence of such foreign attributes that affect the accuracy of the disease prediction, there is a considerable calculation time- consuming. Therefore, attribute selection is an optimizing agent where a subset of attributes is selected to filter and remove less relevant and noisy attributes for more precise and effective data representation. There are numerous possible solutions in the search space during problem-solving. The aim is to choose a solution that optimizes the processing and produces the best possible output. Problems with optimization are those types of problems that are used to determine the best solution among all possible solutions. Optimization issues can be categorized into two types, depending on the attribute types. These issues may be referred to as combinatorial issues when considering discrete-valued attributes. If continuous attributes are taken into account, they are referred to as constrained or multimodal problems.
By using this reduced attribute subset on a classifier to detect the presence of disease, the classification performance of disease diagnosis is improved.
2. Proposed Model
In the proposed framework, an advanced filter-based machine learning model is designed and implemented on the medical databases, as shown in figure 1. In this framework, different types of medical datasets are taken to find the outliers and the data transformation process. After performing the data transformation, different traditional classification models such as Naive Bayes, Logistic Regression, Multilayer Neural Network, K-Nearest Neighbour, Adaboost, C4.5, Random Forest, and also the proposed Random Forest models were implemented. Finally, statistical measures are used to find the performance of the proposed model to the conventional models.
Figure 1: Proposed heterogeneous medical data classification framework
In the proposed framework, heterogeneous datasets are used to find the class prediction using the proposed disease prediction model. Initially, input data is prepared based on different class attributes. Nominal attributes are converted to binary attributes and then missing values are filled with mean values. These filtered data are given to the classification problem for a better disease prediction rate. Here, different nominal attributes are used as class labels for the decision-making process.
Algorithm 1: A Proposed Random forest:
Step 1: Input file (Filtered anomaly data)
Step 2: Preprocess anomaly data for missing values.
Step 3: Data transformation for unequal distribution as
For each attribute Ai in DB Do
. .M(A )
.value * ( )
(A . . A . . )
A value G
A ScaleMax ScaleMin
value Max value Min
End if Else Continue;
Step 4: For each randomized sample Si
Kernel Probability: The kernel probability is used to estimate the conditional variance of input data features by using the gaussian estimator.
B uniqueCV(D); / / Unique column values HB Histobins histogrambin(D)
GaussianKernel : GK( , ) e / (2 * ) gkv GK( HB , B );
Kernel Pr obability KP(D) | HB / ( * HB ) |
i i i d
GaussianEntropy : GE(d ) GK(
d .log(d ), )
In the above equations, the Gaussian entropy is used to check the feature entropy value based on the Gaussian estimator.
Proposed entropy formula:
1 1 2 2
D 2 D 2
1 D D 2 D D
PEe / (2* D )* | HB / (
* HB ) | e / (2* D )* | HB / (
* HB ) |
For each sample in test data If(PE>0)
' ((D , D ));i j S Classify
In the above ensemble based anomaly detection model, each attribute is checked against the data distribution. If the attribute is not uniform distributed then it was transformed to uniform format. For each attribute in the uniform distributed dataset, instancesare partitioned into set of sub-partitions based on classes. After that, similarity computation was applied on the sub- partitions to find the relevant relational anomaly features.
Experimental results are simulated in a python environment with third party libraries.
In the proposed work, standard numerical breast cancer and nominal breast cancer datasets with a large number of feature spaces are taken for experimental study. Initially, nominal datasets are filtered using the nominal to binary conversion. In this process, each attribute is verified against the missing value. If the attribute contains a missing value, then it is replaced with the mean of the attribute. Later, these filtered data are given to the proposed
classification algorithm for disease prediction and decision-making process. Table 1 and table2 represent the heterogeneous nominal to converted binary datasets for the data classification problem. Table 3 represents the standard breast cancer dataset.
Table 1: Heterogeneous breast cancer dataset 1 with class label breast type.
Figure 2: Visualization of Heterogeneous breast cancer dataset 1
Table 2: Heterogeneous breast cancer dataset 2 with class label age
Figure 3: Visualization of Heterogeneous breast cancer dataset 2 Table 3: Standard breast cancer dataset with numerical attributes
Figure 4: Visualization of standard breast cancer dataset
Figure 5: Performance analysis of proposed heterogeneous classification model to the conventional classification models on breast cancer dataset 1.
Figure 6: Performance analysis of proposed heterogeneous classification model to the conventional classification models on breast cancer dataset 2.
0 10 20 30 40 50 60 70 80 90 100
0 10 20 30 40 50 60 70 80 90 100
Figure 7: Performance analysis of proposed heterogeneous classification model to the conventional classification models on standard breast cancer dataset.
In this paper, advanced machine learning approaches are implemented on the medical databases for better decision making. Since most of the conventional approaches are independent of outliers and data size, the proposed model has better efficiency in outliers, filtering and data classification problems. In this paper, a novel feature selection based random forest classifier is proposed to improve the efficiency of the medical datasets.
Experimental results show that the proposed heterogeneous classification framework has better accuracy, nearly 98.6% accuracy than the traditional nominal data classification models.
 J. Kennedy and R. Eberhart, "Particle swarm optimization," Proceedings of ICNN'95 - International Conference on Neural Networks, Perth, WA, Australia, 1995, pp. 1942- 1948 vol.4, doi: 10.1109/ICNN.1995.488968.
 Owolabi, Ibrahim. (2018). Diagnosis of Breast Cancer Using Power Swarm Optimization with SVM. 10.13140/RG.2.2.12398.31045.
 S. B. Sakri, N. B. Abdul Rashid and Z. Muhammad Zain, "Particle Swarm Optimization Feature Selection for Breast Cancer Recurrence Prediction," in IEEE Access, vol. 6, pp. 29637-29647, 2018, doi: 10.1109/ACCESS.2018.2843443.
 M. Kumar, S. K. Khatri and M. Mohammadian, "Review on Breast Cancer Disease Predictive Modelling using Swarm Intelligence," 2020 International Conference onComputational Performance Evaluation (ComPE), Shillong, India, 2020, pp. 523- 530, doi: 10.1109/ComPE49325.2020.9200117.
89 90 91 92 93 94 95 96 97 98 99
 Zamani, Hoda&Nadimi-Shahraki, Mohammad H.. (2016). Swarm Intelligence Approach for Breast Cancer Diagnosis. International Journal of Computer Applications. 151. 40-44. 10.5120/ijca2016911667.
 RozillaJamiliOskouei, NasroallahMoradiKorandSaeidAbbasiMakeki, ”Data mining and medical world: breast cancers diagnosis, treatment, prognosis and challenges”, American Journal of Cancer Research, Vol.7, No.3, pp 610-627, 2017.
 V. Bolon-canedo, N. Sanchez- Marono, A.Alonso-Betanzos, J.M. Benitez, F. Herrera,
“A review of microarray datasets and applied feature selection methods”, Elsevier,Infornation Sciences 282, pp. 111-135, 2014.
 Y. Xiao, J. Wu, Z. Lin and X. Zhao, "Breast Cancer Diagnosis Using an Unsupervised Feature Extraction Algorithm Based on Deep Learning," 2018 37th Chinese Control Conference (CCC)s, Wuhan, 2018, pp. 9428-9433, doi:
 H.-J. Xing and W.-T. Liu, “Robust AdaBoost based ensemble of one-class support vector machines,” Information Fusion, vol. 55, pp. 45–58, Mar. 2020, doi:
 V. Christou, M. G. Tsipouras, N. Giannakeas, A. T. Tzallas, and G. Brown, “Hybrid extreme learning machine approach for heterogeneous neural networks,”
Neurocomputing, vol. 361, pp. 137–150, Oct. 2019, doi:
 C.-L. Huang, H.-C. Liao, and M.-C. Chen, “Prediction model building and feature selection with support vector machines in breast cancer diagnosis,” Expert Systems with Applications, vol. 34, no. 1, pp. 578–587, Jan. 2008, doi:
 P. Gupta and S. Garg, “Breast Cancer Prediction using varying Parameters of Machine Learning Models,” Procedia Computer Science, vol. 171, pp. 593–601, Jan.
2020, doi: 10.1016/j.procs.2020.04.064.
 C. Liangjun, P. Honeine, Q. Hua, Z. Jihong, and S. Xia, “Correntropy-based robust multilayer extreme learning machines,” Pattern Recognition, vol. 84, pp. 357–370, Dec. 2018, doi: 10.1016/j.patcog.2018.07.011.
 Xing, B. and Gao, W.-J. 2014. Imperialist Competitive Algorithm. in Innovative Computational Intelligence: A Rough Guide to 134 Clever Algorithms, ed Cham:
Springer International Publishing. pp. 203-209.
 N. P. Pérez, M. A. Guevara Lopez, A. Silva, and I. Ramos, “Improving the Mann–
Whitney statistical test for feature selection: An approach in breast cancer diagnosis on mammography,” Artificial Intelligence in Medicine, vol. 63, no. 1, pp. 19–31, Jan.
2015, doi: 10.1016/j.artmed.2014.12.004.
 M S Kumar et.al,“Deep learning based image processing approaches for image deblurring”, Materials Today: Proceedings (Science Direct), https://doi.org/10.1016/j.matpr.2020.11.076 , 26 December 2020.
 V. A. Natarajan et.al,"Detection of disease in tomato plant using Deep Learning Techniques”, International Journal of Modern Agriculture, Volume 9, No.4, 2020 ISSN: 2305 -7246 pp: 525-540.
 Sreedhar B et.al, “A Comparative Study of Melanoma Skin Cancer Detection in Traditional and Current Image Processing Technique”, IEEE Xplore: DOI: 10.1109/I- SMAC49090.2020.9243501 10 November 2020. pp: 654 -658.
 V. A. Natarajan et.al, "Segmentation of Nuclei in Histopathology images using Fully Convolutional Deep Neural Architecture",IEEE Xplore September 2020, pp. 1-7, doi:
 Manikandan, R and Dr.R.Latha (2017). “A literature survey of existing map matching algorithm for navigation technology. International journal of engineering sciences & research technology”, 6(9), 326-331.Retrieved September 15, 2017.
 A.M. Barani, R.Latha, R.Manikandan, "Implementation of Artificial Fish Swarm Optimization for Cardiovascular Heart Disease" International Journal of Recent Technology and Engineering (IJRTE), Vol. 08, No. 4S5, 134-136, 2019.
 Manikandan, R., Latha, R., & Ambethraj, C. (1). An Analysis of Map Matching Algorithm for Recent Intelligent Transport System. Asian Journal of Applied
Sciences, 5(1). Retrieved from
 R. Sathish, R. Manikandan, S. Silvia Priscila, B. V. Sara and R. Mahaveerakannan,
"A Report on the Impact of Information Technology and Social Media on Covid–19,"
2020 3rd International Conference on Intelligent Sustainable Systems (ICISS), Thoothukudi, India, 2020, pp. 224-230, doi: 10.1109/ICISS49785.2020.9316046.