An Intelligent Expert System Models to Predict and Diagnostic Breast Cancer Disease Using Bigdata Analytics
1M.Vengateshwaran , 2 Ms.N.Valarmathi, 3V.Sharmila, 4Dr.P.Ezhumalai
1,2,3
Assistant Professor 4 Professor & HOD
Department of Computer Science and Engineering & Information Technology
1 Sri Sai Ram Institute of Technology (Autonomous), Chennai
2 M.Kumaraswamy College of Engineering (Autonomous), Karur
3,4 R.M.D Engineering College (Autonomous), Chennai Tamilnadu, India
E-mail: 1[email protected]
ABSTRACT
Nowadays, People couldn't care less about their wellbeing because of occupied timetable in way of life. In that case of situation, all things considered of circumstance, big data assumes a significant part to give a compelling arrangement through observing and monitoring the health of the patient in hospital and clinic. Clinical areas are rich information source should have been coordinated effectively and helpfully. Sickness reports records are utilized for social occasion business insight and recognizing key patterns in innovation improvement.
Cancer is a collection of diseases that may affect any body-parts such as breast. Breast cancer is the most dangerous problem that makes increasing a number of deaths. Breast cancer is most important problem of women’s deaths world. The essential location of breast cancer is mammography because of unaided eye forecast of the disease by radiologist, they propose for the following conclusion like biopsy, PET etc. These tests are taken more time and high cost effective. BigData analytics technique is applied on breast cancer dataset to determine the stage of breast cancer. The accuracy of different classification techniques such as Support Vector Machine (SVM), Decision Tree, Naive Bayes (NB), k Nearest Neighbors (k-NN), Classification and Regression Trees (CART) , Random forest classifier and Logistic regression is calculated based on Wisconsin Breast Cancer (original) dataset and the model providing high accuracy is used to predict stage of cancer.
Keywords: Breast cancer, Classification Technique, CART, BigData etc.,
I.INTRODUCTION 1.1 BIGDATA
BigData is a data that are enormous in size and exceeds the processing capacity of conventional database systems. It involves the data produced by different devices and applications. Now a day’s used in agriculture, medical, marketing, social Medias and business informatics.
1.2 CHARACTERISTIC OF BIG DATA
BigData should have the following Characteristics.
Fig.1 Characteristics of BigData
✓ Volume: How much data(Size grows from terabyte to exabyte to zettabyte)
✓ Velocity: How fast that data is processed (Batch data, real-time data, streaming data)
✓ Variety: The various types of data (Structured, semi-structured, unstructured)
✓ Veracity:(Data inconsistency, ambiguities, deception)
Fig.2 variety of data
Structured Data: (RDBMS[ERP,CRM], Data Warehouse, Microsoft Project Plan File) 10%
Characterized as length and configuration. Occasions of coordinated data join numbers, dates, and social affairs of words and numbers called strings (for example, a customer's name, address, etc).likely SQL Example - Input information, Click-stream information, Gaming- related information and so on,Semi Structured Data:(XML)-10%
UnstructuredData:(Video, Audio, Text Message, Blogs, Weather Pattern, LocationCoordinates,WebLogs&Dickstreams,SensorData/M2m,Email,SocialMedia,Geospati al Data) 80% unstructured data is data that undefined as length and format.
Presently a day's 80% of information in the organization of unstructured information and 20% of information in the arrangement of organized information.
Why Big Data Need:
Increment of capacity limits Increment of handling power Accessibility of information 1.3 BIG DATA TECHNOLOGIES Apaches’s Hadoop File System:
Hadoop is a java based free software working framework is stage with theorists in APACHE programming establishment. It comprises of the Hadoop Common packagecontains with vital data of Java Archive (JAR) documents with contents expected to begin Hadoop. Which gives document framework and OS reflections, TCP/IP protocols are used to Hadoop distributed file systems. User to communicate between each other via Remote Procedure Call (RPC).
Fig.3 HDFS Framework Architecture
HDFS has engineering. A comprises of ace hub and huge number of slave hub, generally one for each hub in the group. At first, an archive is part and disseminates huge information across numerous hubs in the bunch. The Name Node is responsible for opening a document, shutting a record, rename a document. The Data Nodes is giving a client demand from read/compose activity. Information hubs store all the records as repeated blocks and recover them at whatever point required. HDFS stores documents in an replicated way in the wake of breaking the record into fixed size blocks. blocks size is 64MB and information hub keeps up three duplicates of an information. The Node of NAME and DATA are run regularly without restrictions of HDFS (Hadoop Distributed File System). It is intended generally tolerable records and reasonable for simultaneous compose tasks. It is badly designed to executing a vocation in HDFS document framework. GNU/Linux working framework (OS). HDFS stores huge documents (gigabytes to peta bytes) over numerous machines.
1.4 MAPREDUCE
Map Reduce structure design gives an equal handling to processing amount of tremendous measure parallel of information. Map Reduce, are fragmented with queries and distributed parallel nodes across inquiries are divided and conveyed across equal hubs and prepared (the Map step). Partners them to shape a yield (the Reduce step). MapReduce programs run on Hadoop in languages s like java, python and so on., Map Reduce can exploit handling in information resembling of processing data paralleling.
Fig.4 Map reduce
"Map " step
Mapper is applied in equal on input information. Client given the info (k1,v1) sets from HDFS and produces a rundown of moderate (k2,v2) sets. Mapper yield is divided per reducer for example the quantity of decrease undertakings for that activity.
"Decrease" step
It gathers all answers from information hub and structure a yield. Reducer takes (k2,list n(v2)) values as information, total qualities in list n(v2) and newly produce to combines (kn3,vn3) as conclusive outcome.
Map and Reduce work
The appropriated pattern design based looking, work is then useful in corresponding to each gathering and conveys an a collection of values in the similar area. In this way the Map Reduce structure changes the qualities.
Uses of Map reduce
Map reduce is used in various application like conveyed design configuration based glancing through report grouping, AI and measurable machine interpretation. Other than that, Reduces sources of info and yields are generally put away in an appropriated in a passed on record system.
1.5 BREAST CANCER
Breast cancer growth is the most dangerous issue that makes expanding various passings. Breast cancer growth is most significant issue of women’s deaths world. The essential discovery of Breast cancer is mammography because of unaided eye forecast of the sickness by radiologist, they suggest for the accompanying level of discovering like biopsy, PET, etc.. From the start mammography is used to recognize identify malignant growth in human body. The pictures are apparently investigated by the presence of sickness is given as positive or negative. The precision of pictures is less a result of the exhaustion. To exactness accuracy of perusing, a few methods like PC helped structure, neural framework picture processing and preparing
In any previous works, they base on growing to expanding the precision of mammography examining, effortlessness of sickness distinguishing proof and to help patients in a self- affirmation of infection and phases of illness rather than a subsequent assessment. In a couple of cases the mammographic screening prompts blunder of noncancerous advancement and the forecast can be named as bogus positive incentive for positive and bogus negative incentive for negative. It shows the bogus negative mistake on a positive case. This prompts the reality of unidentified malignancy inside a brief timeframe on account of high pace of development. To a great extent even extraordinarily thick bosom gives incredibly less information on dangerous cell presence by mammography in view of the shortfall of precision regularly the radiologist organizes the patients for a next degree of area, for instance, MRI, PET scope and biopsy. It choose to diminish the expense. The yields of mammography with the help of PC to expect of foresee cancerdisease and the harmful development present to be amiable or compromising.
The various methodologies like Machine learning methodology, and Deep Learning procedures and PC helped plan innovations are utilized to perceive the recognition of sickness cells from the mammographic picture data. Presentation correlations between different AI estimations are Support Vector Machine (SVM), Decision Tree (C4.5), Naive
Bayes (NB) and k Nearest Neighbors (k-NN) on the Wisconsin Breast Cancer (interesting) datasets is driven [5]. The essential objective of an our paper to characterize the information regarding effectiveness and exactness and so forth., Experimental outcomes show that Decision tree (CART) gives the most highest accuracy precision (100%) with least error rate.
The primary point of this paper is to utilize data analytics method to to accurately predict whether the separated tumor is kindhearted or threatening. This strategy exactness to decreases the utilization compared to time and cost, the traditional methods of malignant growth identification area.
II. LITERATURE SURVEY
2.1 Data Classification Using Support Vector Machine Author: D.K. Srivastava and L. Bhambhu
The greater part of the existing supervised classification methods techniques depend on traditional statistics, which can give ideal outcomes when test size is watching out for endlessness[1]. It takes as it were limited no of tests required practically speaking. In that paper, a novel learning technique, In test, the help support vectors, which are basic for classification, are obtained by learning from the training samples. In that paper they have shown the comparative results of the similar outcomes utilizing different kernel functions for all data samples.It can be seen that the decision of piece capacity and best estimation of boundaries kernel function for specific bit is basic for a given measure of information.
2.2 Approximate Kernel k-means: Solution to Large Scale Kernel Clustering Author: R. Chitta, R. Jin, T.C. Havens and A.K. Jain
Advanced Digital data explosion commands improvement of scalable tools to compose the information in an important and effectively structure. Clustering is a grouping to gathering a comparable information things. Anyway grouping Algorithmare utilized to deal with tremendous datasets on genuine world datasets. Algorithm takes linear structures in datathey have proposed an effective estimate for approximation for the kernel k-means algorithm, reasonable for huge informational indexes. The key idea is to abstain from processing the full kernel matrix by restricting the cluster centers to a little subspace crossed by a lot of arbitrarily tested sampled data points
focuses. They show hypothetically and observationally that the proposed algorithm is (I) productive in both computational unpredictability and memory requirement, and (ii) can yield comparative yield similar clustering results as the kernel k-means algorithm using the full kernel matrix. Later on, they plan to logically explore the example analytically investigate the sample complexity of the proposed calculation, for example the minimum number of tests needed to yield comparable similar clustering results as the full kernel K-mean algorithm.
III.PROBLEM DEFINITION 3.1 PROBLEM STATEMENT
To make use of data analytics technique to accurately predict whether the extracted tumor is benign or malignant.
3.2 EXISTING SYSTEM
In image processing, picture division should be possible and the sectioned picture can be clustered by methods for K Means Clustering to recognize the classes. K-means clustering is a strategy to order a vector quantization which is recognizable strategies for group ID and identification of pictures or information. K -nearest neighbor classifier is utilized to characterize new information into the current information on the cluster places acquired by k- means. Thus the trained classes itself identified by means of clustering. The classification required for the multi class of disease detection.
Disadvantages
It does not classify and predict the solutions for the detected diseases.
The existing method makes use of k-means segmentation algorithm whose running time is longer, so different segmentation algorithms need to be used.
Accuracy is very low and does not handle.
3.3 PROPOSED SYSTEM
In this paper data analytics is used to predict and diagnosis the breast cancer. Here we use Wisconsin datasets for training and testing the model. Here different grouping method is utilized to to predict and diagnosis the breast cancer. The precision of different algorithms is estimated and fixed prediction is done using the algorithm that produces level of exactness.
Advantages
The method has to be enhanced so that it can efficiently classify disease effectiveness with different types of identifications.
It deals with all kinds of disease if knowledge can be trained with the class.
IV.SYSTEM ARCHITECTURE
Fig.5 System Architecture
The architecture depicts the entire system of classifying the data using the different types of classifiers. The data is collected from Wisconsin data set then data is preprocessed, stored and extracted. The data is split into two types, one is train data and another one is test data. Then all classifier models are trained with train data and accuracy is calculated with test data.
V.SYSTEM MODULES 5.1 Data Preprocessing
Data preprocessing is a data mining strategy that includes changing raw data information into a reasonable organization. It is utilized to eliminate the undesirable or information from datasets.
5.2 Data Security
It is used to protectdata from unauthorized access. Data security also protects data from corruption.
5.3 Data Storage
It is used to store a data from database.
5.4Feature Extraction and Selection
Feature extraction is the cycle of characterizing a lot of highlights (qualities, for example, position, shape, size and surface which will all the more effectively to represent the data that is significant for analysis and characterization. These phases are used for extracting relevant features from image dataset.Feature extraction is related to dimensionality reduction.
Feature selection is the process of selecting relevant features. These phases are used to select relevant features from image dataset.
Fig.6 Feature Extraction and Selection 5.5 Classification
Distinctive classifiers are utilized to take the correct choice on schedule. The distinctive classifier, for example, CART models are looked at here. Our point of arrangement is to predict the exact outcome for every single information.
5.5.1 Support Vector Machine
Support Vector Machine (SVM) is a supervised machine learning algorithm utilized for both difficulties in classification path and regression path. However, it is generally utilized in arrangement issues.
5.5.2 Decision Tree
A decision tree algorithm is represents tree like structure. Every single hubs dependent on decision with a predefined variable in characterization issues and works for both absolute and constant information and yield factors. Trees must be sufficiently large to fit training data (so that "valid" designs are completely caught).
Advantage:
very quick to prepare and assess
easy to decipher 5.5.3 CART
Classification and Regression Trees (CART) calculation is a classification calculation for building a decision tree dependent on Gini’s impurity index as splitting criterion. CART algorithm is represented by binary tree format and then splitting a node into number of child nodesCART can deal with both numeric and unmitigated factors. It can easily deal with exceptions.
5.5.4 Naive Bayes
The Naive Bayes algorithm is a specific element is free of the event to different highlights in Bayes' standard or Bayes' law. The principle point in the Naive Bayes algorithm is to calculate the probability of an conditional object withan item of component vector featurehas a place with a specific class
5.5.5 K-Nearest Neighbors
It perform tests level between records and training information. Characterization information are put away to decide the classification (Ko and Seo 2000). The assigned strategy as a moment based learning calculation to sorts objects dependent on element space in the training set. Here training information are spoken to in a multi-dimensional component space. This component space thus partitioned into areas dependent on the class of the training data.Ainformation point in the element space is planned to a specific class in the event that it is the most regular classes among the k nearest training information. The separation between data focuses are frequently determined utilizing the Euclidean Distance measure.
5.5.6 Logistic Regression
It contains more than one free factors. The result is factual technique estimated with variable (wherein there are only two likely outcome results).Logistic regression investigation is the relationship between a categorical dependent variable and a lot of free (illustrative) factors.
The name regression is used when the dependent variable has only two characteristics, for instance, 0 and 1 or Yes and No. Exactness of different classifier is portrayed in the depicted table.1
Decision Tree 100%
SVM 96%
KNN 90%
Logistic Regression 89%
CART 100%
Naive Bayes 95%
Table 1. Accuracy of various classifiers
Then gives the 100% accuracy result for Decision tree (CART) and Random forest classifier.
The produces the output will be 0 (or) 1. The Random Forest classifier& Decision tree(CART) is the best classifier to produce the accuracy result for the cancer patient.
VI.SAMPLE OUTPUT
Fig.7 output of cancer stage VII.CONCLUSION
The principle of the paper is to propose a classification model utilizing different classifiers (k- nearest neighbor, naïve bayes, support vector machine, random forest classifier and decision tree) on the patients' information gathered. The patients' information comprises of breast cancer parameters growth boundaries and decision tree order model is prepared and trained and tested to utilizing information examination. Likewise, the proposed model is utilized for
early recognition of the breast cancer growth. The exploratory outcomes demonstrate that CART can be considered as the best classifier as its precision execution significantlybetter than the other classifiers.
FUTURE ENHANCEMENT
The proposed paper can be accessed only by the trained professional to diagnosis the breast cancer. As a future enhancement user interface will be provide, such that it can be utilized with minimal training.
REFERENCES
[1] Abbasnejad, M.E. Ramachandram, D. andMandava, R. A survey of the state of the art in learning the kernel, Springer Berlin Heidelberg, Knowledge and Information Systems, Vol. 31, No. 2, pp. 193-221,2012.
[2] Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77:81–97. doi: 10.1016/j.ijmedinf.2006.11.006.
[3] Chaurasia, V. and Pal, S. Data Mining Techniques : To Predict and Resolve Breast Cancer Survivability, vol. 3, no. 1, pp. 10– 22,2014.
[4] Chitta, R. Jin, R. and Havens, T.C. Approximate Kernel k-means: Solution to Large Scale Kernel Clustering, 17th ACM SIGKDD international conference on Knowledge discovery and data mining ACM, USA, pp. 895-903,2011.
[5] GhadaAyeldeen, Kareem Kamal A.Ghany Diagnosis of Breast Cancer using secured classifiers, 978-1-5386-0872-2,2017.
[6] Murdoch TB, Detsky AS. The inevitable application of big data to health care. JAMA. 2013;309:1351–1352. doi: 10.1001/jama.2013.393.
[7] Ruotsalainen, P. Privacy and security in teleradiolog, European Journal of Radiology, Elsevier, Vol. 73, pp. 31-35,2010.
[8] Scruggs SB, Watson K, Su AI, Hermjakob H, Yates JR, 3rd, Lindsey ML, Ping P.
Harnessing the heart of big data. Circ Res. 2015;116:1115–1119. doi:
10.1161/CIRCRESAHA.115.306013.
[9] Shao, G.Z. Zhou, R.L. Zhang,Q. Molecular cloning and characterization of LAPTM4B, a
novel gene upregulated in hepatocellular carcinoma, Oncogene, Vol. 22, pp. 5060- 5069,2010.
[10] Srivastava, D. and Bhambhu, L. Data Classification Using Support Vector Machine, Journal of Theoretical and Applied Information Technology, Vol. 12, No. 1, pp. 1-7, 2010.
Authors Profile:
Mr.M.Vengateshwaran M.E., Assistant Professor
Sri Sai Ram Institute of Technology (Autonomous) Chennai, India
Specialization: BigData, Machine Learning, IR, SNA
Ms.N.Valarmathi M.Tech., Assistant Professor
M.Kumarasamy College of Engineering, Karur Specialization: Data Analytics, Image Processing,TOC
V.Sharmila M.E., Assistant Professor
RMD Engineering College, Thiruvallur ,India
Specialization: TOC, Compiler Design, Machine Learning
Dr.P.Ezhumalai Professor & HOD
RMD Engineering College, Thiruvallur ,India Specialization: Computer Networks, Multicore, TOC