• Nu S-Au Găsit Rezultate

View of Intrusion Detection and Anomalies in Sensory Data Using Random Forest Algorithm

N/A
N/A
Protected

Academic year: 2022

Share "View of Intrusion Detection and Anomalies in Sensory Data Using Random Forest Algorithm"

Copied!
12
0
0

Text complet

(1)

Intrusion Detection and Anomalies in Sensory Data Using Random Forest Algorithm

Susan Zarifi

1*

, Reza Azmi

2

1 Department of Information Technology Engineering, Faculty of Computer Engineering, University of Alzahra, Tehran, Iran. [email protected].

2 Department of Computer Engineering, University of Alzahra, Tehran, Iran. [email protected].

ABSTRACT

Today, security in computer networks is of great importance. However, since industrial networks have their own distinctions and needs compared to conventional networks, these differences prioritize and secure methods in industrial networks over ordinary networks. SCADA systems are an integral part of large industrial systems that collect information and control equipment. Given the lack of an integrated definition of security in industrial systems, it may not be a good solution to ensure absolute security of the system, but by applying security assessment strategies, vulnerabilities in industrial control systems can be identified and reduced or eliminated as far as possible.This study aimed to detect anomalies in SCADA systems. Due to the lack of access to real anomalous data on these networks, artificial data was created. This process was accomplished by implementing an artificial malformation-generating algorithm in Java programming language. This process was performed for the ARIMA model in R software and for the random forest algorithm in R in the Scala programming language on the big data platform. In this study, it was attempted to generate anomalous SCADA data by implementing a distribution-based artificial anomaly generation algorithm and then to perform anomaly detection and anomaly process using random forest algorithms and bulk data on SCADA systems.

Keywords

SCADA Systems, Intrusion detection, Random forest algorithm, Spark.

Introduction

Intrusion detection systems are security tools, similar to antivirus software, firewalls, and access control schemes, designed to enhance the security of information and communication systems. Intrusion detection systems have a variety of ways to detect malicious activity. Each of these systems has problems such as detecting unknown attacks, requiring a large database update, and a high false-positive rate. The main purpose of the SCADA systems at the time of its creation was to ensure that under all circumstances, network equipment performs its intended function and can be remotely controlled. Therefore, the issue of security was not taken seriously and as time went on and new risks arose, the issue of security emerged as a major concern. In 1990, the University of Berkeley introduced a new model for macro data processing called Spark, focusing on in-memory computing, that is, in-memory computing as much as possible, and despite the memory capacity of the device. This represents a big jump that works very well for iterative algorithms.

Problem statement

The release of reports of cyber-attacks on SCADA systems indicates that these attacks may occur at any other time in industrial control systems such as the oil and gas industry and to prevent intrusion into web-based systems and industrial control communications network platforms through cyber-attacks. The goal is to detect intrusions and anomalies in SCADA network data before the damage occurs. The intrusion and anomaly detection process was performed using the ARIMA time series and random forest algorithms on SCADA system data. Research shows that using a random forest algorithm provides significant performance for very high volume data. Thus, in this study, it was attempted to generate anomalous SCADA data by implementing a distribution-based artificial anomaly generation algorithm and then to perform anomaly detection and anomaly process using random forest algorithms and bulk data on SCADA systems.

* Corresponding author

E-mail address: [email protected]

(2)

Research background

Research on the security of SCADA's industrial control systems has little history, but researchers have conducted numerous researches, many of which have been published as articles, books, and even guidelines. Roland Elecros [1]

examined the risks, security problems, industrial protocols, and vulnerabilities of these systems, and then introduced security solutions to address those risks. Security issues in industrial control systems have been studied and issues such as the penetration and exploitation of vulnerabilities in these systems have been addressed in order to gain control and access to the system [2]. Robert and Johnson [3] first identified the security problems and risks in SCADA systems at five levels of management, coordination, monitoring, control, and physical networking, and then examined possible cyber-attacks on each layer separately, and finally, they introduced security solutions. Sayegh et al. [4] investigated internal cyber-attacks on SCADA systems. Their results show that SCADA systems' protocols are highly vulnerable and that cyber-attacks such as the denial of service (DOS) attacks can bypass these protocols on SCADA networks.

Researchers [5] focus on how to assess security risks and provide practical measures and strategies to minimize these risks to protect critical infrastructure against cyber-attacks. Given the above, the importance of cybersecurity in critical industrial control systems and SCADA infrastructures, such as oil and gas, as well as in oil and gas storage tanks in terms of cyberattacks by terrorists and malicious workers, was clearly demonstrated.

1.1. Research Method

The raw data were obtained from Khorasan Razavi Gas Company. In the first step, due to the lack of access to real anomalous data, the researcher implemented an artificial anomaly production algorithm to construct synthetic anomalous data for final analysis.

The goal was to detect infiltration and anomalies using ARIMA and random forest time series algorithms. From this analysis, it is predicted to what extent SCADA data over a period is subject to intrusion and malformation. In the end, the error percentage of these algorithms are compared.

Figure 1. How the main project process works

A distributed-based artificial anomaly generation algorithm was implemented in order to conduct this project. Then, the data was executed and stored with the implemented algorithm. Big data system hardware was prepared and deployed in a cluster, and then Apache Spark was implemented. The intrusion detection process was performed with ARIMA and random forest time series algorithms and useful information were extracted in the analysis process.

The process of this study was that raw data were obtained from Khorasan Razavi Gas Company. Because anomaly data was not available from SCADA networks, a distribution-based artificial anomaly generation algorithm was implemented to generate the anomalous data. The data were then tested using RStudio software and the ARIMA model.

Then, the random forest algorithm was used. This algorithm was chosen because, according to studies, it is unique among the current algorithms in terms of accuracy and applicable to very large data sets.

(3)

In the first stage, Java was installed on one physical machine and three virtual machines, as Spark is a Java-based engine. The Spark system was then installed on all the machines and the necessary settings were made to communicate between them. One of the machines was manually designated as the driver and the other three knots as laborers. Spark concurrently performed a series of processes on the cluster. This synchronization was created by the SparkContext object that allocates system resources to processes [6]. In fact, the SparkContext object was defined in the same master program or the same driver program.

Selecting RStudio software and R programming language

R is a programming language and software environment for statistical computing and data analysis based on S and SCIM languages. This open-source software is released under the GNU General Public License and is available for free. It is also a powerful software for creating graphical shapes and charts. R has a command-line environment for entering and executing commands and can be installed on most platforms. Also, a large number of software packages (more than 2000 cases) can be installed on it that cover various statistical fields and thus give this seemingly small software a great power.

RStudio is an integrated development environment (IDE) for R, which combines a GUI with powerful programming tools to help make the most of R.

Spark Installation and Configuration

Spark can be installed on Windows and Linux. However, it is recommended to use on Linux. For this purpose, they were first installed on Centos virtual machines. Since Spark uses Java Virtual Machine to execute commands and programs, installing Java was the first step in executing the project. Then the Scala programming language package was installed and the PATH value was set for Scala. The following figure shows the Spark-Shell.

Figure 2. Getting started with Spark-Shell Pre-processing data

In this section, raw data were obtained from gas well No. 66 in the Khangiran area. The total raw data available for this well was 82 rows. An example of the measured properties of well No. 66 is given in Table 1.

Table 1. An example of the attributes and data received.

(4)

Date Pressure/depth (psi/ft)

Temperature (degF) Pressure (psi) Pressure (ft)

October 14, 2012 0.0915 111.55 3544.406 0

October 14, 2012 0.089 126.28 3818.781 3000

October 14, 2012 0.0883 168.97 4085.906 6000

October 14, 2012 0.1003 224.7 4350.716 9000

October 14, 2012 0.1363 229.07 4370.774 9200

October 14, 2012 0.3748 234.04 4398.028 9400

October 14, 2012 0.4293 237.91 4472.992 9600

October 14, 2012 0.4296 241.29 4558.857 9800

October 14, 2012 0.4632 244.41 4644.772 10000

October 14, 2012 0.1073 246.92 4667.932 10050

In the first step, synthetic anomalous data were generated from the above table data using the implemented algorithm.

The number of anomalous data generated was 223 rows. An example of synthetic malformed data is presented in Table 2.

Table 2. An example of synthetic malformed data generated from the original data

Date Pressure/depth

(psi/ft)

Temperature (degF) Pressure (psi) Pressure (ft)

October 14, 2012 0.0915 111.55 3544.406 4000

December 7, 2013 0.0435 226.93 2314.997 10050

December 7, 2013 0.0495 54.17 1895.233 9750

November 11, 2015 0.0506 132.7 2576.158 10125

October 14, 2012 0.0883 168.97 4085.906 10150

November 11, 2015 0.0506 132.7 2576.158 4000

August 12, 2015 0.05 127.69 2356.84 9500

July 6, 2015 0.056 104.18 3252.67 6000

October 14, 2012 0.1003 224.7 4350.716 10000

October 14, 2012 0.0915 111.55 3544.406 2000

This data was then converted to SVM from the initial CSV format to run in a random forest algorithm implemented with Scala. This was done by writing the Scala code for the csvToSVM format conversion. The pseudo-code of the implemented algorithm is given below.

Algorithm1: Convert csvToSVM

Input:

csvFile Output:

svmFile

(5)

Method:

1. val tmp = str.split(",")

2. val stringBiulder = StringBuilder.newBuilder 3. stringBiulder.append(str(0))

4. for (a <- 1 to tmp.length - 1)

5. stringBiulder.append(" " + a + ":" + tmp(a)) 6. stringBiulder.append("\n")

7. stringBiulder.toString() 8. val tmp = date.split("/")

9. tmp(0) + getMonth(tmp(1)) + getMonth(tmp(1)) 10. val idx = oldString.lastIndexOf(":")

11. oldString.substring(0, idx + 1) + getDate(oldString.substring(idx + 1, oldString.length)) + "\n"

12. output ⟵ csvFile.foreach(s => svmFile.write(convertToSVM(s))) 13. return output

1.2. Random forest algorithm

Random forest algorithm is one of the most efficient intrusion detection techniques that can be used for intrusion detection systems [7]. Random forests can also address the problem of anomalies directly. Samples were used to construct trees for each sample and were assigned to each class. This classification can be ordinary data versus anomalous data, which is unparalleled in terms of accuracy [8]. Different versions of the random forest algorithm offer selective feature options based on each attribute to classify normal versus abnormal data. This is important for large datasets with a large number of features, some of which are unrelated. In the random forest algorithm, the training dataset is used to construct the detector pattern model offline, which is needed to detect anomalies in the online system.

The most important thing about random forests is that it works with a fixed number of trees and a set of computational features of time and the memory requirements can be kept almost constant regardless of the number of samples and training dataset features during the offline processing phase. Then in real-time, the patterns were found on a fixed number of trees built during training. As a result, the temporal complexity changes to a linear relationship with a number of samples in the online dataset. Another notable point of the random forest is the technique available to calculate the value of the attributes within the input dataset. This technique results in the reduction of the analyzed features, which can substantially reduce the computational complexity that is too high for a dataset with irrelevant features or features that do not add useful information for pattern building [7].

Random forest is a group classification that includes many decision trees. The way to build a random forest is bagging.

Given the independent variable x, random forest classification is a group classification model that combines k decision tree classification. Each classification decision tree votes for one of the classifications, and the final winner is the one with the most votes. The main task of random forest classification is to first select k samples from the original training dataset at random using the bag method. Then, k samples of group classification training set to grow k trees will be in

(6)

accordance with the results of k classifications. Finally, the k classifications vote for the majority to select the optimal classification [9].

The random tree forest makes many decisions. It places a random forest at the end of each tree to classify a new object from the input vector. Each tree gives us a classification and it is said that this tree gives that class a "vote". The classification forest that has the most votes (among all forest trees) is selected.

Each tree was formed as follows:

1. If N is the number of modes in the training dataset (task set), then N modes are randomly sampled by pasting the original data. This is a sample work set for this tree.

2. If M is a variable and m is smaller than M so that at each node, the m variables are randomly selected from M, and the best separation on these m variables is used to separate the node, then the value of m is assumed to be constant during forest construction.

3. Each tree grows as large as possible. There is no pruning.

The error rate of the forest depends on two things:

 Correlation between both trees in the forest. The increased correlation increases the forest error rate.

 The power of each tree in the forest. Each tree with a low error rate is a strong cluster. Increasing the power of each tree reduces the error rate of the forest.

Decreasing m decreases both correlation and power, and increasing m increases both.

The way the random forest operates is that several trees make the prediction result and the end result is the result of all the trees.

Figure 3. Random forest performance How to run the random forest algorithm

The process of running the algorithm was that the data is pre-processed first. After preparing the data, a file containing all the computed properties was prepared, which stated at the end whether the data mentioned was anomalous. These data were divided into two separate sections of teaching and assessment with a 70:30 ratio. The initial 70% of the data included the result column of correct or incorrect prediction of the anomaly from which the algorithm learned to

(7)

perform. Then, at the evaluation stage of the model, the remaining 30% of the data was tested and the accuracy of the model was obtained.

Figure 4. Stages of random forest algorithm implementation and evaluation of forecasting results.

The random forest algorithm was implemented with the Scala programming language to be able to run on Spark Shell.

The pseudo-code of the implemented algorithm is given below.

Algorithm2: RandomForest

Input:

MLUtils.loadLibSVMFile Output:

Error Method:

1. val splits = data.randomSplit(Array(0.7, 0.3)) 2. val (trainingData, testData) = (splits(0), splits(1)) 3. val treeStrategy = Strategy.defaultStrategy("Regression") 4. Val numTrees

5. Evaluate model on test instances and compute test error 6. val prediction = model.predict(f.features)

7. val testErr =1- testData.map { point => val prediction = model.predict(point.features) 8. println("Point : " + point.toString() + "Prediction : " + prediction)

if (point.label == prediction) 1.0 else 0.0 }.mean()

9.

(8)

10. output ⟵ Error 11. return output

Results

The anomalous data were run with a random forest algorithm in RStudio software. Our data included five variables that were used. The results of the random forest algorithm output in R showed that with 500 trees generated by the algorithm, the error rate is 0.37, which is much lower than in the ARIMA model.

The following figure shows the output graph of this algorithm in terms of error and the number of trees created.

Figure 5. Random forest algorithm output diagram in terms of error and number of trees.

According to the diagram above, it was found that for our anomalous data with five variables approximately 150 trees the error rate remained constant. The number of trees with the least amount of error can vary depending on the number of variables and the amount of anomalous data.

1.3. Output results of random forest algorithm in spark

After implementing the random forest algorithm with the Scala programming language, the anomalous data were tested with the algorithm implemented in the Spark platform. The output results are shown in Figure 6.

(9)

Figure 6. Random forest algorithm output executed on Spark by error and tree number.

Because of the randomness of the random forest algorithm and the Spark big data platform, the experiment was repeated dozens of times, and the mean results in the three graphs were typically plotted in the figure above.

In this section, although the data volume is not in terabyte (which is the usual volume of macro data), on a smaller scale, it was observed that the error rate of implementing the Scala language random forest algorithm and its implementation on the Spark system is much lower than that of the ARIMA model and with the use of the random forest in R.

Then, the random forest algorithm implemented was investigated in terms of time overhead. The diagram below shows the timing of running the random forest algorithm in Spark.

0 0.05 0.1 0.15 0.2 0.25

0 100 200 300 400 500 600

Error rate

Number of Trees Random Forest

اطخ نازیم اطخ نازیم اطخ نازیم Error rate Error rate Error rate

(10)

Figure 7. Run time diagram of the random forest algorithm with increasing data size

As we can see in the graph above, the runtime graph increases linearly with increasing data size.

1.4. Validation

The error rate criterion was used to evaluate and validate the method, which shows that the random forest algorithm performs better than the ARIMA time series. To evaluate the quality of the proposed method, Vignesh2208 2016 project data, modbusRTU_DoSResponseInjectionV2, was used. With the traffic of this dataset as malformed data attacked by the injection, the dataset contains 28,944 rows with 27 attributes. The random forest algorithm was tested.

The results are shown in Figure 8.

0 0.5 1 1.5 2 2.5 3

0 2 4 6 8 10 12 14 16 18

Run ti me ( s )

Data size (MB)

Time to run a random forest algorithm in Spark

(11)

Figure 8. The output of random forest algorithm implementation with injection attack data on Spark by error and tree number.

Due to the random forest algorithm, the experiment was repeated several times, and the average results in the three graphs were typically plotted above. As is obvious, the proposed algorithm can still detect anomalies with different tree numbers with an error rate of less than 0.25. This indicates that the larger the volume of anomalous data and the greater the number of features, the random forest algorithm detects with fewer error anomalies.

The results show that the use of the random forest algorithm results in better accuracy and performance in order to predict anomalies. It is also suitable for running on big data systems with large amounts of data. The results of the process implementation show that the random forest algorithm has a much lower error rate and high prediction accuracy. Also, the implementation of the random forest algorithm in big data platform has less error. The following table shows a comparison of the results in the execution methods as well as the implementation of the random forest algorithm in R and in Spark.

Table 3. Comparison of results in implementation methods

The implemented method Random forest algorithm in R Random forest algorithm in Spark

Error rate 0.37 0.17

Run time (s) 1.19 0.679

The results show that the implementation of the random forest Algorithm in Spark performs better than other tested methods and predicts anomalies with less runtime and error.

References

[1] Krutz, R.L., Securing SCADA systems. 2005: John Wiley & Sons.

0 0.05 0.1 0.15 0.2 0.25 0.3

0 100 200 300 400 500 600

Err or r at e

Number of Trees Random Forest

اطخ نازیم اطخ نازیم اطخ نازیم Error rate Error rate Error rate

(12)

[2] Knapp, E.D. and J.T. Langill, Industrial Network Security: Securing critical infrastructure networks for smart grid, SCADA, and other Industrial Control Systems. 2014: Syngress.

[3] Johnson, R.E. Survey of SCADA security challenges and potential attack vectors. in Internet Technology and Secured Transactions (ICITST), 2010 International Conference for. 2010. IEEE.

[4] Sayegh, N., et al. Internal security attacks on scada systems. in Communications and Information Technology (ICCIT), 2013 Third International Conference on. 2013. IEEE.

[5] Haji, F., L. Lindsay, and S. Song. Practical security strategy for SCADA automation systems and networks.

in Canadian Conference on Electrical and Computer Engineering, 2005. 2005. IEEE.

[6] Cluster Mode Overview. 11/08/2016]; Available from: http://spark.apache.org/docs/latest/cluster-overview [7] Zhang, J. and M. Zulkernine. Network Intrusion Detection using Random Forests. in PST. 2005. Citeseer.

[8] 8.Carrasquilla, U., Benchmarking algorithms for detecting anomalies in large datasets. MeasureIT, Nov, 2010: p. 1-16.

[9] Han, J., Y. Liu, and X. Sun .A scalable random forest algorithm based on MapReduce. in Software Engineering and Service Science (ICSESS), 2013 4th IEEE International Conference on. 2013. IEEE.

Referințe

DOCUMENTE SIMILARE

Survival analysis, Cox relative hazards model, Random forest algorithm, CoxRF turnover prediction algorithm. As a result, it's critical to investigate and

Wrapper approach that employs Intelligence algorithms, namely, Score based Artificial Fish Swarm Algorithm (SAFSA), Mutation Score Butterfly Optimization Algorithm (MSBOA) and Score

In thispaper, a hybrid performance for data organization and information extrapolation is recommended.The Honey Bee MatingOptimization (HBMO) algorithm and

In this paper, an advanced deep learning algorithm called Leaf Disease Estimation using Deep Learning Principle (LDEDLP) based plant leaf disease detection strategy is

In an image, it detects the face and eyes by using Viola Jones algorithm, and then determines whether the eyes are closed or open to detect drowsiness to alertthe driver to

In Semantic module, rule embedded ontology was developed to enrich the collected data and avoid heterogeneity.The enriched data was then classified using DNNand

This work proposes an algorithm of brain Tumor detection based on Transfer learning with a small amount of dataset. Mainly, it improves the accuracy of

In this work, machine learning methods based on a k- nearest neighbor, support vector machine, naïve Bayes, and random forest classifiers with the integration of genetic algorithm for