View of Cross-Project Defect Prediction based on Cognitive Metrics Using Sampled Boosting

(1)

Cross-Project Defect Prediction based on Cognitive Metrics Using Sampled Boosting

N. Vijayaraj, Research Scholar, PG and Research Department of Computer Science, Periyar EVR College (Autonomous), (Affiliated to Bharathidasan University), Trichy, Tamil

Nadu, India.

Dr. T. N. Ravi, Assistant Professor, PG and Research Department of Computer Science, Periyar EVR College (Autonomous), (Affiliated to Bharathidasan University), Trichy, Tamil

Nadu, India Abstract

Software defect prediction in software components is mandatory to provide reliable software components, failing of which might lead to disastrous consequences. Lack of sufficient training data and presence of imbalance in the data tends to create challenges in developing an effective automated software defect prediction module. This work proposes an effective ensemble model aimed to handle the challenges in the defect prediction process. The proposed model operates on cross project data to generate a training model. Further, generation of cognitive metrics has proved to improve the prediction process. The boosted ensures reduction in bias due to imbalance, hence ensuring effective predictions, which are evident in the experimental results.

Keywords: Software defect prediction; cross project defect prediction; cognitive metrics;

software metrics; boosting; ensemble modelling 1. Introduction

Software defects are issues evolve during development process by programmers unintentionally. This might be due to several reasons like lack of experience, misunderstanding of the problem or unreasonable development process. Software with defects tends to produce unintentional results, which might even lead to huge economic losses [1]. Software testing is the process that is used to identify software defects in advance to commence the necessary corrections. Although this process is highly useful, testing resources are usually limited and hence performing intensive testing on each of the module is practically impossible [2].

Automated techniques to identify defects is one of the most opted techniques to check if the module is prone to defects and the type of testing required. These are known as software defect prediction techniques.

Software Defect Prediction (SDP) is the automated identification of software defects. This becomes a highly important task during the maintenance and the evolution phase of the software to ensure software quality [3, 4]. Software defect prediction is generally performed in two ways;

the first type of models analyzes the existing data to identify additional metrics that can be used

(2)

to effectively improve the prediction process, the second type is to propose machine learning models that can effectively perform better defect predictions [5, 6].

Software defect prediction process is generally composed of two phases; defect prediction model construction phase and the software defect prediction model application phase. The model construction phase is designed to identify and perform the learning process based on details contained in the software repository. The application phase is composed of applying the created model over the current project data to obtain predictions. Defect predictions are usually made in several levels; package level [7], file level [8], method level [9] and change level [10].

The model construction phase tends to use data from the repository. This tends to introduce the issue of data imbalance into the model. Data is considered as imbalanced, if one of its classes exhibits large number of instances, while the other class exhibits very low instance levels [11].

The class exhibiting low instance levels is termed as the minority class, while the class exhibiting large number of instances is termed as the majority class [12]. Majority classes tend to bias the prediction process. This work proposes a cross project defect prediction model aimed to handle the bias created by imbalance, and also ensures effective predictions.

2. Related works

Defect prediction has been a major domain of analysis due to the increase in usage of software products in many domains. Works dealing with identifying effective metrics for usage during the detection process are on the raise, due to the unavailability of effective metrics in the recorded data. A model presenting a metric suite for defect prediction was proposed by Mihoka et al. [13]. The metric suite is conceptual and coupling based, and is named as comet. The model has been tested on both supervised and unsupervised techniques and has been identified to exhibit effective performance. A feature selection based model to perform enhanced software defect prediction was proposed by Ni et al. [14]. This model is based on multi-objective feature selection (MOFES) to enhance the prediction efficiency. The model considers two optimization objectives; one to minimize the number of selected features, the other to maximize the performance of the defect prediction model. Pareto based multi-objective optimization algorithms are used for this purpose. Another method dealing with identifying and eliminating redundant metrics was proposed by Jiarpakdee et al. [15]. Other similar techniques operating in this domain includes works by Nam et al. [16] and Hosseini et al. [17]. Kernel Spectral Embedding Transfer Ensemble (KSETE) technique for software defect prediction was proposed by Tong et al. [18].

An association rule mining based technique for software defect prediction was proposed by Shao et al. [19]. The work is mainly based on handling the data imbalance issue contained in the software defect prediction domain. Association rules are based on identified based on occurrence frequencies. This issue is handled in the model proposed in [19]. The work uses correlation weighted class association rule mining (CWCAR) technique to handle the problem of data imbalance. It uses a multi-weighted support based framework to ensure effective handling of the

(3)

minority class data. Techniques using association rule mining for software defect prediction includes works by Song et al. [20] and Chang et al. [21]. A forest based data mining technique used for software defect prediction was presented by Ding et al. [22]. The model operates by building isolation trees based on features to perform defect predictions. The model also uses ensemble pruning strategy to improve performance. Other ensemble based techniques used in the software defect prediction process include works by Zheng et al. [23] and Siers et al. [24]. Deep learning model for software defect prediction was proposed by Qiao et al. [25]. This work is composed of two phases; the preprocessing phase that performs data transformation and normalization, and the deep learning model.

3. Proposed Methodology

Sampled Boosting (SBoost) based Cross-Project Defect Prediction

Cross Project defect prediction has become a significant component in the defect prediction domain due to the unavailability of data for new projects. This work considers multiple cross project data to create an effective

model for the current project. The model also includes a component that can be used to integrate the small amount of data, if available from the current project. The proposed Sampled Boosting (SBoost) model is composed of four major components; 1.data preparation, 2.data integration, 3.data sampling and training data preparation, 4. boosting model creation and prediction.

3.1. Cross-Project based Data Preparation

Data preparation forms the first phase of the process. The work considers two categories of data. The data obtained from current project, and the data obtained from other similar projects. Data from current project is limited, however, has very high correspondence with the data to be predicted. Data from similar projects correspond to cross project data, and is used to handle the data insufficiency issue during the training process. The training data, is hence an agglomeration of data from multiple projects. Projects contain some common features with various attributes. Integration of multiple data is only possible when the data contains same attributes. Else, the attributes that are missing in one data will be filled with null values.

The first process is to identify attributes that are similar. These are the attributes that can be directly merged. Cognitive metrics, if available are identified from the data. Cognitive metrics are derived attributes, and can be calculated. Hence these are calculated for all the data sets. This ensures all the datasets contain available cognitive attributes. This marks the end of the data preparation process.

3.2. Cognitive Metric based Data Integration

Data preparation often results in adding additional data to the existing datasets. Training data is prepared by integrating all the cross project data and the data available from the current project. The varied nature of projects results in varied number and type of attributes.

Hence common attributes from all of the datasets is identified. Most of these attributes will

(4)

correspond to the recently calculated cognitive attributes. Next level attribute selection is made by selecting all the attributes contained in the current project. The attributes selected in both the levels are integrated to form the final attribute set for the training data. Data instances from the current project and the cross project data sets are vertically integrated to form the final training data. Since some attributes might be missing in certain project data, the training set is prone to contain missing data. Instance based imputation is performed to obtain the final training data.

3.3. Data Sampling and Training Data Preparation

Project data is prone to be imbalances. Many instances depicting normal code will be available, while the number of instances depicting defect code is very limited. Data sampling is generally used to counter data imbalance. Defect prediction domain generally suffers from data unavailability. Hence under sampling is avoided. Oversampling is considered as the sampling model of choice. Oversampling operates by creating additional instances of the minority class to match the majority class levels. Since the training data is composed of cross project data, oversampling the entire data has a very high probability of introducing bias in the training data. Hence, oversampling is performed only on the current project data. This automatically leads to closing the imbalance level and also increases the instance levels of the current project, resulting in improving the data quality.

3.4. Boosting based Model Creation and Prediction

Boosting is a type of ensemble model that performs iterative reduction of errors in the trained model to enable improved predictions. The machine learning algorithm used in the training process is known as the base model. The proposed SBoost model uses a tree based machine learning algorithm as the base model. The usage of tree based model aids in dynamic decision rule creation, aiding in faster and more effective training.

LetT(x) be the tree based base model used for training. The imputed cross project training data (x) is passed toT(x)for the first level training, and the predictions (y’) is obtained. This process is given by

𝑦′= 𝑇(𝑥)

Errors contained in the predictions y’ is determined by identifying the difference between the actual results (y) and the predictions (y’). The process of obtaining errors is given by

𝑒 = 𝑦^′− 𝑦

The next iteration performs integration of errors into the training process to increase the weights of the instances that were wrongly predicted. The next training process involves integration of error levels to obtain the next level predictions (y’’), which is given by

(5)

𝑦^′′= 𝑇 𝑥 + 𝑒

However, y’’ is not free from errors. Errors in the second level (e’) is identified by 𝑒^′ = 𝑦 − 𝑦′′

These errors are again integrated into the base model. This process is repeated until error levels drop to the required threshold. After this, the trained model is considered for the prediction process.

The new module data generated from the current project is passed through the data preprocessing phase. The additional cognitive features are generated and integrated into the data. This data is passed to the trained boosted model for prediction.

4. Results and discussion

The SBoost model has been implemented using Python. Performance of the SBoost model is measured using project details from the PROMISE [26, 27] repository. PROMISE repository contains details about ten varied projects. Each project is measured based on 20 metrics, which are a mixture of cognitive, object oriented and standard software performance metrics. Each instance in a dataset represents the defect status of a module.

An analysis of the true prediction metrics, True Positive Rate (TPR) and True Negative Rate (TNR) is shown in figure 1. TPR represent the prediction efficiency of defects, while TNR represent the prediction efficiency of non-defect modules. It could be observed from the figure that the TPR levels of the SBoost model high levels representing high prediction efficiency.

Figure 1: True Prediction Levels of SBoost

(6)

An analysis of the false prediction metrics, False Positive Rate (FPR) and False Negative Rate (FNR) is shown in figure 2. Both the values are required to be low for an effective model. It could be cobserved from the figure that the proposed model exhibits very low FPR and FNR levels, exhibiting low error levels.

Figure 2: False Prediction Levels of SBoost

A comparison of aggregated metrics is shown in figure. Aggregation metrics include precision of the prediction process and the accuracy levels. High precision and accuracy levels depict effective prediction models, which is evident from the figure 3.

Figure 3: Aggregated Prediction Levels of SBoost

(7)

5. Performance Comparison and Analysis

A comparison of the performance of SBoost with KSETE model proposed by Tong et al. [18]

is performed and the results are provided below. TPR based comparison is shown in figure 4.

TPR levels of SBoost could be observed to be much higher than the KSETE model, depicting highly improved defect detection levels.

Figure 4: TPR Comparison of SBoost with KSETE

FPR levels have been compared and shown in figure 5. Lower FPR levels indicate better performance. SBoost model could be observed to exhibit lowest FPR levels compared to KSETE model, exhibiting reduced false prediction levels.

Figure 5: FPR Comparison of SBoost with KSETE

(8)

Average performance levels of all the datasets is shown in figure 6. TPR levels of SBoost indicate an improvement of 23%, and FPR levels of SBoost indicate a reduction of 25% when compared to the average performance of the KSETE model. This shows that the SBoost model exhibits highly effective performance in the defect prediction process.

Figure 6: Average Comparison of SBoost with KSETE

6. Conclusion

Automated defect prediction in software components has become the need for the day due to the increased usage of software in several domains. This work proposes an ensemble based automated defect prediction model. The model has been trained using cross project data. Hence it proves to be highly effective in operating on new or relatively new projects with low historical data. Additional metrics are calculated to ensure presence of cognitive metrics for improving the prediction performance. The boosted model has been observed to effectively handle the issue of data imbalance contained in the data. The major advantage of the proposed model is that it can operate on any software data for training, resulting in high degree of usage. Experiments and comparisons indicate highly improved performances with average increase of 23% in TPR levels and 25% reduction in FPR levels.

7. References

[1] Establishing a software defect prediction model via effective dimension reduction

[2] Yan, M. , Fang, Y. , Lo, D. , Xia, X. , Zhang, X. , 2017. File-level defect prediction: unsupervised vs. supervised models. In: Empirical Software Engineering and Measure- ment (ESEM), 2017 ACM/IEEE International Symposium on. IEEE, pp. 344–353 .

(9)

[3] Czibula,G.,Marian,Z.,Czibula,I.G.,2014.Softwaredefectpredictionusingrelationalassociationr ulemining.Inf.Sci.264,260–278.

[4] Miholca,D.,2018.Animprovedapproachtosoftwaredefectpredictionusingahybridmachinelearni ngmodel,in:201820thInternationalSymposiumonSymbolicandNumericAlgorithmsforScientifi cComputing(SYNASC),pp.443–448.doi:10.1109/SYNASC.2018.00074.

[5] Miholca,D.L.,Czibula,G.,2019.Softwaredefectpredictionusingahybridmodelbasedonsemanticf eatureslearnedfromthesourcecode,in:Douligeris,C.,Karagiannis,D.,Apostolou,D.(Eds.),Knowl edgeScience,EngineeringandManagement-

LNCS,volume11775,SpringerInternationalPublishing,Cham.pp.262–274.

[6] X. Peng, B. Liu, S. Wang, Feedback-based integrated prediction: Defect prediction based on feedback from software testing process, J. Syst. Softw. 143 (2018) 159–171.

[7] Schr¨oter, T. Zimmermann, A. Zeller, Predicting component failures at design time, in:

Proceedings of the 2006 ACM/IEEE International Symposium on Empirical Software Engineering, ISESE ’06, ACM, New York, NY, USA, 2006, pp. 18–27

[8] M. Yan, Y. Fang, D. Lo, X. Xia, X. Zhang, File-level defect prediction: Unsupervised vs.

supervised models, in: Proceedings of the 11^th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’17, IEEE Press, Piscataway, NJ, USA, 2017, pp. 344–353

[9] H. Hata, O. Mizuno, T. Kikuno, Bug prediction based on fine-grained module histories, in:

Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, IEEE Press, Piscataway, NJ, USA, 2012, pp. 200–210.

[10] Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha, N. Ubayashi, A large-scale empirical study of just-in-time quality assurance, IEEE Transactions on Software Engineering 39 (6) (2013) 757– 773.

[11] Somasundaram, Akila, and U. Srinivasulu Reddy. "Data imbalance: effects and solutions for classification of large and highly imbalanced data." In International Conference on Research in Engineering, Computers and Technology (ICRECT 2016), pp. 1-16. 2016.

[12] Somasundaram, Akila, and U. Srinivasulu Reddy. "Modelling a stable classifier for handling large scale data with noise and imbalance." In 2017 International Conference on Computational Intelligence in Data Science (ICCIDS), pp. 1-6. IEEE, 2017.

[13] COMET:Aconceptualcouplingbasedmetricssuiteforsoftwaredefectprediction

[14] An empirical study on pareto based multi-objective feature selection for software defect prediction

[15] Jiarpakdee, J. , Tantithamthavorn, C. , Ihara, A. , Matsumoto, K. , 2016. A study of redundant metrics in defect prediction datasets. In: Proceedings of the International Symposium on Software Reliability Engineering Workshops, pp. 51–52 .

[16] Nam, J. , Fu, W. , Kim, S. , Menzies, T. , Tan, L. , 2017. Heterogeneous defect prediction. IEEE Trans. Softw. Eng. PP (99), 1 .

(10)

[17] Hosseini, S. , Turhan, B. , Mantyla, M. , 2017. A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Inf.

Softw. Technol. 95, 296–312 .

[18] Tong, Haonan, Bin Liu, and Shihai Wang. “Kernel Spectral Embedding Transfer Ensemble for Heterogeneous Defect Prediction.” IEEE Transactions on Software Engineering, 2019.

[19] Software defect prediction based on correlation weighted class association rule mining [20] Q. Song, M. Shepperd, M. Cartwright, C. Mair, Software defect association mining and

defect correction effort prediction, IEEE Trans. Softw. Eng. 32 (2) (2006) 69–82.

[21] C.P. Chang, C.P. Chu, Y.F. Yeh, Integrating in-process software defect prediction with association mining to discover defect pattern, Inf. Softw. Technol. 51 (2) (2009) 375–384.

[22] Improved software defect prediction using Pruned Histogram-based isolation forest [23] Zheng J. Cost-sensitive boosting neural networks for software defect prediction. Expert

SystAppl 2010;37:4537–43.

[24] [28] Siers MJ, Islam MZ. Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. InfSyst 2015;51:62–71.

[25] Deep Learning Based Software Defect Prediction

[26] Jureczko, Marian, and Lech Madeyski. "Towards identifying software project clusters with regard to defect prediction." In Proceedings of the 6th international conference on predictive models in software engineering, pp. 1-10. 2010.

[27] Ferenc, Rudolf, ZoltánTóth, GergelyLadányi, IstvánSiket, and Tibor Gyimóthy. "A public unified bug dataset for Java." In Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering, pp. 12-21. 2018.