View of An Unique and Integrated Approach about the facts of Big Data Cloud Adoption using Data Mining

(1)

An Unique and Integrated Approach about the facts of Big Data Cloud Adoption using Data Mining

M.S. Minu¹, Sanjesh Chevanan², John Vivin Samuel³ SRM Institute of Science and Technology, Ramapuram Campus

1[email protected], ²[email protected], ³[email protected]

ABSTRACT

Journals, research articles, e-records/documents, social media and so forth are an incredible source of data. These sources play a valuable role in research and expansion ventures. This paper showcases an outline of text mining and its use towards data extraction from the previously mentioned sources. In our exploration, we used word cloud, term frequency analysis, similarity analysis, cluster analysis, and topic modelling to separate data from multi-area amazon dataset. Cloud computing and big data are new emerging innovation. So it is basic to eliminate important models and data from the dataset in these territories and discover the association between them. We characterize an apparatus and strategy to order the blue pencil data. We statistically develop words in the ordered data and label the hidden unbiased words with importance in context. Utilizing Computational Linguistics instruments and altering them to suit our methods, we examine the facts for Cloud Computing Adoption and Big Data Processing.

Keywords

Cloud Computing, Data Mining, Text Mining, Big Data processing

INTRODUCTION

The enormous and developing number of published studies, and their expanding rate of distribution, makes the assignment of distinguishing relevant studies in a fair-minded manner for incorporation in precise audits both complex and tedious. Text mining has been offered as a possible arrangement: through robotizing a portion of the screening cycle, reviewer time can be saved.There is a lot of information available to us in the form of e-documentation but these information also have a lot of information that is not needed for us which will be a lot of work for a user to get.These days the vast majority of the data in business, industry, government and different establishments is put away in text structure into database and this text database contains semi structured data. A report may contain some to a great extent unstructured text parts like conceptual moreover scarcely any structured fields as title, name of creators, date of distribution, class, etc. Text mining is a minor departure from a field called data mining that attempts to discover intriguing examples from huge databases. The incredible arrangement of studies done on the demonstrating and execution of semi structured data in late database research. Based on these research data recovery strategies, for example, text ordering techniques have been created to deal with unstructured archives. For getting all the information from the published e-documentation we need text-mining to get all the information from the e-documents into a processable and readable format. In this paper we are getting published papers from journals and finding the topics which are relevant to our need using text-mining. The purpose of the project is to use an efficient approach to capture facts from texts for Big Data Cloud Adoption using Data Mining.

(2)

LITERATURE SURVEY

[1]Jyotiska Nath Khasnabish; Mohammad FirojMithani; Shrisha Rao in 2015 proposed T- BICA (TIER Centric business Impact and Cost Analysis) ,a tier-centric optimal asset allocation algorithm, to address the issue of fast provisioning of IT assets in current enterprise cloud environments.[2]Ling Liu; Zijiang Yang; Younes Benslimane in 2014 proposed a method for mining and archiving location data from user phone applications and also other applications which share location with the company for archiving.[3] Rong; Yi'an Liu in 2020 proved that text clustering algorithm, K-means and hierarchical agglomeration clustering is better than K-means algorithm.[4]Chetna Chand, Amit Thakkar, Amit Ganatra in 2012 proposed a systematic survey of the sequential pattern mining algorithms is performed to find sequential patterns among the large database. In this paper limitation and research challenges of Sequential pattern mining is discussed.[5]D. K. Singh, Varsha Sharma, S.

Sharma in 2012 proposed a new approach for mining web usage data by generating frequent web access pattern from web server logs and it has found an innovative way of mining, and to indicate how often pattern identifying tasks can be done by capturing complex user’s browsing behaviour in to a graph data structure in order to get unknowninformation about the access patterns of the user.[6]Madini O. Alassafi, RayadAlGhamdi, Abdulrahman Alharthi, Abdulwahid Al Abdulwahid , Sheikh Tahir Bakhsh in 2019, a study investigating the associated security factors with cloud computing that impact organizations want for adopting cloud computing services.[7]Yousef A. M. Qasem , Rusli Abdullah, Yusmadi Yah Jusoh, RodziahAtan, And ShahlaAsadiin 2019 proposed a coherent scientific categorization and an outline of the fundamental qualities as far as motivation and boundaries of adopting Cloud computing in HEIs, existing individual and hierarchical models to comprehend the future prerequisites for broadly adopting and utilizing Cloud computing in HEIs, and variables that impact the adoption of Cloud computing in HEIs at individual and corporate levels.[8]Victor Chang, Muthu Ramachandran in 2015 developed a system known as Cloud Computing Adoption Framework (CCAF) which has been redone for getting cloud information, it clarifies the reasoning and parts in the CCAF to ensure information security.[9] M S Minu Sanjudharan et al. published a paper where they are using Data analysis to select the top fields in the dataset we have collected. In this paper we can see that there are many comparisons graphs to show how data analysis plays a big role in all the aspects of the day to day world for us we are using data analysis to create a concept for extracting the important topic related to computer science in the dataset that we have chosen.

EXISTING SYSTEM

The existing system uses multiple techniques to extract data from the published papers from IEEE, Springer, ACM, Elsevier, Wiley. These techniques include term frequency analysis, similarity analysis, cluster analysis and topic modelling (which is done using LDA). This process starts with getting all the data needed for extracting information form so they have

(3)

technique and is used to categorize the information as related and non-related.The information is thenreleased to the user.

PROBLEM STATEMENT

The problem statement defined by this paper is that there are a lot of products in the amazon and other e-commerce websites. We need to know how customers feel about a certain product, whether the review is positive or negative, based on their review message. To address this, we create a Neural Network with the help of some NLP models to generate a Recursive Neural Network model. But first we need to find the important fields in the dataset in orderto visualize and extract useful data for processing. In this paper we are getting the dataset from the Kaggle dataset library which is available for free and finding the feedback of the product review.

PROPOSED SYSTEM

In the proposed system we are using a different dataset which is taken from kaggle.com website which gives the rating, date, variation, verified_review, feedback of a set of products from the time 16-5-`18 to 31-7-`18. We are going to use a Recursive neural network known as Long Short Term Memory (LSTM) to create a Model to predict the reaction between the verified_review and the feedback if the feedback is positive or negative. Since the verified_review is text based data we need to use specific text processing techniques to make the text data to processable numerical values which can be fed into the model for calculation.

The library that we are using for the NLP is the NLTK where we use word_tokenize to tokenize the words to numerical value.

Properties

Base Paper Proposed paper

Algorithm Clustering Algorithm LSTM Neural Network

Dataset Published papers from journal like IEEE, Springer, ACM

We are using amazon dataset from kaggle

Accuracy 36% 90%

Flexibility Will not be flexible to use with legacy systems for deployment

Highly flexible with legacy systems

Efficiency Not efficient for large scale data Efficient with large scale data

Computation Speed Low computation Speed High computation speed Cluster Computing

Compatibility

Not Compatible with global clustering Compatible with global clustering

Cost High Cost for Storage Low cost for storage

(4)

ARCHETECTURE DIAGRAM

SYSTEM DESIGN

In this system we are using google’stensorflow deep learning library with GPU acceleration to increase the performance of training. If we need a prolonged training session, we use Google’s Colabs research for training using a simple single core performance. The advantages of using GPU over CPU is that they have more processing cores in them than the CPU. For example, we use a GTX 1050ti to train the model which has 768 cuda cores but only a 1.4 ghz processing speed if we are using a i5-8500H we have a quad-core processor with 2.4 ghz processing speed. We are using both of them for separate purposes. For training the model we are using Google’s tensorflow docker image in a ubuntu:20.04 operating system. For testing we are using CPU power because the model needs to run on any system.

MODULES Data Extraction

The data extraction module is used for getting the data from the amazon websites so we can

(5)

Figure 1. Distribution of product reviews

Exploratory data Analysis

The exploratory data analysis consists of multiple text mining processes which are similarity analysis, term frequency, word cloud, clustering, term frequency matrix. In the Similarity Analysis we are considering all the repeated terms in all the papers. Let us consider that we have a set of papers let the set be A = {a1, a2, a3, …..an} and also let the repeated terms be a set which is T = {t1, t2, t3, …, tn}, so the similarity between them can be ta = {tf(a, t1), …, tf(a, tn)} in tf(a, t) the a ∈ A and t ∈ T. In the LSTM RNN we are using text data as the input and a probability calculation as the output.

Pre-processing

In the pre-processing phase we use NLP to extract important topics.Since we are using unstructured data as the raw data, NLP is used to extract the important topics and its information. The pre-processingmodule’s major role in the text mining is to extract all the relevant labels in the data that we have collected. Then the processed labels are fed into the RNN unit after which the output of the neural network is a probability value of positive or negative feedback.

(6)

Figure 2. Distribution of ratings for different varients

Prediction

The prediction is done using the LSTM RNN which is used for predicting the information in the dataset for the reaction between the input and the output. Here we predict the feedback using the verified_review.

EXTRACTION AND ANALYSIS OF DATA

The first stage of the process starts with collecting the amazon data from kaggle. The raw data contains rating, date, variation, verified_review, feedback as the features of the dataset.

Thefirst stage of machine learning is to identify what are the features needed and what is the label needed. So we compare and find the factors that relate to the feedback of the products.

Wecompare different columns like the date, rating and rating, variation and date, variation etc.

(7)

Figure3. Rating vs Count

From the data comparision, we understand that the most important feature is the verified_review and the label is the feedback.We train the Recursive Neural Network using these two as the input and output respectively. We are going to use a specific type of neural network known as the LSTM (Long Short Term Memory) which has the capacity to store and give a feedback to the neuron with the previous calculation.We use a NLP model known as the word_tokenize for creating a numerical value for all the words in the verified_review.Then the data is purified to a more filtered and streamlined version by using regex and other programming entities.They are then introduced into the neural network model for calculating the output-which is the dense layer that gives the probability of the value.We then find a max value to place it into the two categories of the feedback- positive or negative. The trained model has an accuracy of 90 percentage.

CONCLUSION

This paper centres around the utilization of text mining methods for data extraction in the space of cloud computing and its application towards big data processing. It discovers key variables for efficient adoption and obstacles in its adoption. The assorted text mining strategies are applied, for example, term frequency analysis, similarity analysis, cluster analysis and topic demonstrating (LDA). Utilizing term frequency analysis we tracked down the high-frequency words in writing and connected words in the two classifications of articles. Similarity analysis in terms of class shows that these articles are not by and large comparative but rather interlinked in importance in the context of their areas. Cluster analysis

(8)

method shows that profoundly related records aggregated in one cluster, it implies that these articles examining a similar topic. The topic demonstrating procedure gathering papers into legitimately related topics.

REFERENCES

[1]Jyotiska Nath Khasnabish; Mohammad FirojMithani; Shrisha Rao :Tier-Centric Resource Allocation in Multi-Tier Cloud Systems.

[2]Ling Liu; Zijiang Yang; Younes Benslimane: Using Data Mining Techniques to Improve Location Based Services.

[3]Youjin Rong; Yi'an Liu: Staged text clustering algorithm based on K-means and hierarchical agglomeration clustering.

[4]Chetna Chand, Amit Thakkar, Amit Ganatra: Sequential Pattern Mining: Survey and Current Research Challenges

[5]D. K. Singh, Varsha Sharma, S. Sharma

: Graph based Approach for Mining Frequent Sequential Access Patterns of Web pages.

[6]Madini O. Alassafi, RayadAlGhamdi, Abdulrahman Alharthi, Abdulwahid Al Abdulwahid , Sheikh Tahir Bakhsh :Determining Factors pertaining to Cloud Security Adoption Framework in Government Organisations: An Exploratory Study.

[7]YOUSEF A. M. QASEM , RUSLI ABDULLAH, YUSMADI YAH JUSOH, RODZIAH ATAN, AND SHAHLA ASADI : Cloud Computing Adoption in Higher Education Institutions: A Systematic Review.

[8]Victor Chang, Muthu Ramachandran:

Towards achieving Data Security with the Cloud Computing Adoption Framework.

[9]M S Minu Sanjudharan et al., Data Analyzer Using the Concept of Machine Learning [10]. Sukanya, M., Biruntha, S.: Techniques on text mining. In: 2012 IEEE International Conference on Advanced Communication Control and Computing Technologies(ICACCCT), pp. 269–271. IEEE (2012)

[11]Salloum, S.A., Al-Emran, M., Shaalan, K.: A Survey of lexical functional grammar in the Arabic context. Int. J. Com. Net. Tech. 4(3) (2016)

[12]. Pazienza, M.T. (Ed.): Information extraction: Towards scalable, adaptable systems.

Springer (2003).

[13]Hongming Cai ; Boyi Xu ; Lihong Jiang ; Athanasios V. Vasilakos, IoT-Based Big Data Storage Systems in Cloud Computing: Perspectives and Challenges, 2016.

[14]Byungseok Kang ; Daecheon Kim ; Hyunseung Choo, Internet of Everything: A Large- Scale Autonomic IoT Gateway, 2017.

[15]Anne H. Ngu ; Mario Gutierrez ; Vangelis Metsis ; Surya Nepal ; Quan Z. Sheng, IoTMiddleware: A Survey on Issues and Enabling Technologies, 2016.

[16]Gaikwad, S.V., Chaugule, A., Patil, P.: Text mining methods and techniques. Int. J.