CN114064459A

CN114064459A - Software defect prediction method based on generation countermeasure network and ensemble learning

Info

Publication number: CN114064459A
Application number: CN202111243350.7A
Authority: CN
Inventors: 孟海宁; 郑毅; 冯锴; 朱磊; 杨哲; 张嘉薇; 黑新宏
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-10-25
Filing date: 2021-10-25
Publication date: 2022-02-18

Abstract

The invention discloses a software defect prediction method based on generation of a countermeasure network and ensemble learning, which comprises the following steps: step 1, preprocessing a software defect data set, dividing the preprocessed software defect data set into a training set and a testing set, and calculating a resampling rate; step 2, constructing and generating a confrontation network model; step 3, inputting the training set into the generated confrontation network training to obtain a trained generated confrontation network; step 4, generating new few-sample defect data by using the generated confrontation network after training according to the resampling rate to obtain a resampled training set; and 5, constructing a software defect strong classifier by using an AdaBoost method, and inputting the test set to the trained software defect strong classifier to obtain a software defect prediction result. The invention solves the problem of unbalanced software defect data, and improves the accuracy, the recall rate and the F-measure performance of the software defect prediction method.

Description

Software defect prediction method based on generation countermeasure network and ensemble learning

Technical Field

The invention belongs to the technical field of software defect prediction, and particularly relates to a software defect prediction method based on generation countermeasure network and ensemble learning.

Background

With the popularization of informatization, various software in use or in the development process may have software defects, the software defects can be found and positioned in time, and the method plays an important role in normal operation of a software system and perfection of software functions. Software Defect Prediction (SDP) is intended to find the modules from a Software project that are most likely to contain defects, where a module may be a function, a loop body, or a class, etc. Generally, the software defect prediction method is divided into the following three steps. Firstly, marking the defects and non-defects of the software modules according to the bug reports of each module in the historical software. Secondly, in order to train various software defect prediction models, different software defect measurement index information is used as the characteristics of the defect data. And finally, after certain data preprocessing, dividing a training set and a testing set, and performing training and testing work of the software defect prediction model.

At present, many software defect prediction methods are proposed, such as a support vector machine, a decision tree, a random forest, an AdaBoost method, and the like. However, these methods only briefly preprocess the original defect data, and make different modifications at the algorithm level, and do not consider the problems existing at the data level. Considering that the proportion of software defects in the whole software is small, the software defect data set belongs to typical class imbalance data, and for example, in the PCs 1, PC2 and PC3 of the NASA MDP software defect data set, a serious class imbalance phenomenon exists. As the sample data is resampled, the originally unbalanced data becomes balanced. Therefore, a resampling method is introduced into the software defect prediction method, and typical resampling methods such as a Synthetic least-Sampling Technique (SMOTE), an Adaptive Synthetic Sampling method (ADASYN), a Random Over Sampling (ROS), a Random Under Sampling (RUS), and the like. The SMOTE and ADASYNN method adopts a mode of artificially synthesizing few sample data to reduce the class imbalance problem of data, the rule of generating data is manually specified, and a proper rule is difficult to find in a real software defect data set. The ROS and RUS method directly reduces the data volume of the software defect prediction model training set, and the actual application effect is poor.

Disclosure of Invention

The invention aims to provide a software defect prediction method based on generation countermeasure network and ensemble learning, which can adaptively resample data through the generation countermeasure network when dealing with different software defect data sets, relieve the problem of data imbalance, and meanwhile, adopt the AdaBoost method of ensemble learning to classify and train software defects, thereby improving the accuracy, recall ratio and F-measure performance of the software defect prediction method.

The technical scheme adopted by the invention is as follows:

the software defect prediction method based on generation of the countermeasure network and ensemble learning comprises the following steps:

step 1, preprocessing a software defect data set, dividing the preprocessed data set into a training set and a testing set, calculating the ratio of defect data and non-defect data in the training set data, and recording the ratio as a resampling rate;

step 2, constructing and generating a confrontation network model, wherein the model comprises a generator and a discriminator;

step 3, training the training set data by adopting the generated countermeasure network until the generated countermeasure network converges or reaches the set iteration number, and obtaining the generated countermeasure network after training;

step 4, generating new minority class defect data according to the resampling rate calculated in the step 1 by using the generated confrontation network after training to obtain a training set after resampling;

step 5, training a software defect strong classifier by using an ensemble learning AdaBoost method, inputting the training set resampled in the step 4 into the software defect classifier, and finishing training; and (5) checking the effect of the classifier on the test set to obtain the software defect prediction evaluation index.

The invention is also characterized in that:

the loss functions of the generator and discriminator are shown in equations (1) and (2):

DLoss＝BCELoss(x_true,y_true)+BCELoss(x_fake,y_fake) (1)

GLoss＝BCELoss(x_fake,y_true) (2)

wherein DLoss is the loss function of the discriminator and GLoss is the loss function of the generator; the generator and the discriminator both adopt an Adam optimizer, and three super parameters including learning rate (learning rate), Betas and iteration times are used; x is the number of_fake、x_true、y_true、y_fakeThe BCELoss is a binary cross entropy loss function, and the formula is as follows:

wherein x_i，y_iRespectively, the ith software defect sample and a corresponding label thereof, and n is the total number of samples.

The generator in the step 2 comprises an initial random noise input layer, a final generated data output layer and a block type structure consisting of a linear layer, a batch normalization layer and a LeakyReLU activation function, wherein the batch normalization layer is not arranged in the first block type structure behind the noise input layer, the batch normalization layer is also not arranged in the previous block type structure of the data output layer, the Sigmoid activation function is adopted in the data output layer, the noise input dimension is noise _ dim, the noise input dimension passes through the 3 block type structures and finally passes through the full connection layer of the output layer, and the output dimension is out _ dim;

the discriminator in the step 2 comprises an initial data input layer, a final discrimination result output layer and a block type structure consisting of a linear layer and a LeakyReLU activation function; the dimension of input data is out _ dim, and the dimension of the input data is the same as that of the data output by the generator; the identification result output layer adopts a linear layer and a Sigmoid activation function, and the final output result of the identifier is a numerical value from 0 to 1.

The step 3 comprises the following steps:

step 3.1, carrying out normalization operation on the training set, and constraining the numerical range of the training set to be between 0 and 1;

step 3.2, inputting random noise with a numerical range of 0 to 1 into a generator G to generate forged data, and marking the data label as 0; recording the normalized training set label in the step 3.1 as 1, inputting the training set label and the forged data into a discriminator D together, and distinguishing real data from the forged data;

and 3.3, repeating the step 3.2 until the confrontation network is generated to be converged or the set iteration number is reached, and obtaining a generator G and a discriminator D which are trained.

The specific operation of step 4 is as follows: step 4.1, calculating to obtain few-sample software defect data needing to be generated by using the generator G in the generated countermeasure network trained in the step 3 according to the resampling ratio k in the step 1, inputting random noise from 0 to 1 into the generator G obtained in the step 3, and obtaining generated few-class defect data through calculation of the generator;

and 4.2, performing reverse normalization operation opposite to the normalization operation in the step 3.2 on the minority class defect class data generated in the step 4.1, and merging the minority class defect class data with the original software defect-less sample class training data to obtain the re-sampled training set data.

The specific operation of step 5 shown is as follows:

step 5.1, dividing the re-sampled training set in the step 4 into ten-fold cross validation, dividing the whole training data set into 10 parts, taking each part as a validation set, inspecting the classifier during training, repeating the process for 10 times, finally carrying out weighted average to obtain the training performance index of the classifier, and finishing the training process to obtain a trained software defect strong classifier C;

and 5.2, using the software defect strong classifier C in the step 5.1 to carry out the inspection of the classification performance of the model on the test set to obtain the classification result of the strong classifier C on the test set, and inspecting the classification performance of the strong classifier C on the test set.

In step 5.1, the process of training the software defect strong classifier C to obtain the software defect strong classifier is implemented by adopting an AdaBoost learning method, and the process is shown in formulas (6), (7) and (8):

wherein alpha is_mAs the weight of the mth decision tree weak classifier, C_m(x) The m decision tree weak classifier, C (x) the strong classifier obtained by ensemble learning, sign function for weighting the m weak classifier results, e_mError rate for classification of the mth decision tree weak classifier.

The invention has the beneficial effects that:

1) the invention provides a software defect prediction method based on generation of a countermeasure network and ensemble learning. The software defect data is resampled by using the generated countermeasure network, the defects of a manual synthesis type resampling method are overcome, the class unbalance problem of the software defect data is relieved, and the accuracy, the recall rate and the F-measure performance of the classifier can be improved when various software defect data sets are dealt with.

2) The method solves the problem of software defect prediction from two levels, uses the generation countermeasure network to resample data at the data level, uses the ensemble learning method to train the classifier at the algorithm level, combines two different angles to solve the problem of software defects, and improves the classification performance.

Drawings

FIG. 1 is a general flow chart of a software bug prediction method based on generation of countermeasure networks and ensemble learning according to the present invention;

FIG. 2 is a diagram of a countermeasure network structure generated in the software defect prediction method based on generation of the countermeasure network and ensemble learning according to the present invention;

FIG. 3 is a schematic diagram of the structure of a generator and a discriminator in a software bug prediction method based on generation countermeasure network and ensemble learning;

FIG. 4 is a graph showing the variation of the accuracy of the model on the test set before and after resampling the generated countermeasure network in the software defect prediction method based on generation of the countermeasure network and ensemble learning.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention relates to a software defect prediction method based on generation of a countermeasure network and ensemble learning, which comprises the following steps of:

In the step 1:

the defect data set used by the method is derived from a NASA software defect data set and comprises 12 sub data sets, and the used software defect measurement criteria comprise a McCabe measurement method, a HalStead scientific measurement method, a code line number measurement method and a CK measurement method.

Wherein the preprocessing operation is the removal of duplicate data, duplicate attributes, and exception data. And after the preprocessed data are randomly sampled, the training set and the test set are divided.

Wherein the step 2: the structure of the Generator (Generator) and Discriminator (Discriminator) is:

the generator comprises an initial random noise input layer, a final generated data output layer and a block type structure consisting of a linear layer, a batch normalization layer and a LeakyReLU activation function, wherein the batch normalization layer is not arranged in a first block type structure behind the noise input layer and a previous block type structure of the data output layer, the Sigmoid activation function is adopted in the data output layer, and the numerical range of the generated data is mapped to an interval from 0 to 1. The noise input dimension is noise _ dim, and the noise passes through the 3-block structure and finally passes through the full connection layer of the output layer, and the output dimension is out _ dim.

The discriminator comprises an initial data input layer, a final discrimination result output layer and a block type structure consisting of a linear layer and a LeakyReLU activation function. The data input layer receives real few-sample defect data and fake data generated by the generator, the dimension of the input data is out _ dim, and the dimension of the input data is the same as that of data output by the generator. The identification result output layer adopts a linear layer and a Sigmoid activation function, and the final output result of the identifier is an interval from 0 to 1. The input data firstly passes through a data input layer, then passes through a block type structure, and finally passes through a discriminator result output layer to obtain discriminator output results from 0 to 1.

DLoss＝BCELoss(x_true,y_true)+BCELoss(x_fake,y_fake) (1)

GLoss＝BCELoss(x_fake,y_true) (2)

where DLoss is the loss function of the discriminator and GLoss is the loss function of the generator. The generator and the discriminator both adopt an Adam optimizer and use three super parameters of learning rate (learning rate), Betas and iteration number. x is the number of_fake、x_true、y_true、y_fakeThe BCELoss is a binary cross entropy loss function, and the formula is as follows:

The specific steps of the step 3 are as follows:

step 3.1, carrying out normalization operation on the training set, and constraining the numerical range of the training set to be between 0 and 1, wherein the formula used in normalization is shown in formula (4), wherein x_ijIs the j characteristic value, x, of the ith software defect sample_{j_min}Is the minimum value of the jth characteristic value, x_{j_max}The value is the maximum value of the jth characteristic value, x is the normalized characteristic value and is a floating point number between 0 and 1;

step 3.2, inputting random noise with a numerical range of 0 to 1 into a generator G to generate fake data, and marking the data label as 0; marking the normalized training set label in the step 3.1 as 1, inputting the training set label and the forged data into a discriminator D together, and distinguishing real data from the forged data;

The specific operation of step 4 is as follows: step 4.1, using the generator G in the generated countermeasure network trained in step 3, calculating to obtain the minority class defect class data to be generated according to the resampling ratio k calculated in step 1, inputting random noise from 0 to 1 into the generator G obtained in step 3, and obtaining the generated minority class defect class data through calculation of the generator:

N＝(k-1)*T (5)

n is the data quantity of defect classes to be synthesized, k is the resampling rate, and T is the data quantity of a few classes of defect classes in the training set;

and 4.2, performing reverse normalization operation opposite to the normalization operation in the step 3.2 on the minority class defect class data generated in the step 4.1, and merging the minority class defect class data with the original software defect-less sample class training data to obtain a re-sampled training set.

The specific operation of step 5 is as follows:

step 5.1, the software defect classifier uses an AdaBoost integrated learning method, M identical decision tree weak classifiers are used as weak learners, and a final software defect strong classifier is obtained through weighted average; the training process specifically comprises the following steps:

and 5.1, dividing the training set resampled in the step 4 into ten-fold cross validation, dividing the whole training data set into 10 parts, taking each part as a validation set, checking the classifier during training, repeating the process for 10 times, and taking the weighted average as the final performance index of the classifier. And obtaining the trained software defect classifier C after the training process is finished. The process of obtaining the final classifier by training with an AdaBoost learning method is shown in formulas (6), (7) and (8):

wherein alpha is_mAs the weight of the mth decision tree weak classifier, C_m(x) The M decision tree weak classifier, C (x) the strong classifier obtained by ensemble learning, sign function for weighting the M weak classifier results, e_mError rate for the mth decision tree weak classifier classification;

and 5.2, using the strong classifier C trained in the step 5.1, and using the accuracy rate, the recall rate and the F-measure performance indexes to carry out performance inspection on the trained software defect classifier on a test set.

According to the method provided by the invention, the generation of software defect data is carried out by adopting the generation of the countermeasure network at the data level, after the problem of data class unbalance is relieved, an ensemble learning method is used at the algorithm level, wherein a ten-fold cross validation mode is adopted during training, the randomness and the reasonability of the training process are ensured, and finally, the software defect classification accuracy is improved by combining the generation of the countermeasure network and the algorithm.

Example 1

Step 1 is executed, the software defect data set used in the present embodiment is derived from a NASA software defect data set, and includes 12 sub-data sets, and the software defect measurement criteria used include McCabe measurement, HalStead scientific measurement, code line number measurement, and CK measurement. Wherein each subdata set contains a different number of features, as shown in the feature column of table 1. After the data preprocessing operation is performed on the original NASA software defect data set, the duplicated data, the duplicated attributes and the abnormal data are cleared, and the NASA software defect data set after the preprocessing is shown in table 1.

The data set of table 1 was randomly sampled and the training set and test set were partitioned according to an 8:2 ratio. And respectively counting the number of the defect data and the number of the non-defect data in the training set data, and then calculating the ratio of the defect data to the non-defect data to obtain a resampling rate k for calculating the number of the resampling data.

TABLE 1 NASA software Defect dataset

Step 2 is executed, the generation countermeasure network constructed in the present embodiment is divided into two parts, i.e., a generator and a discriminator, wherein the specific construction of the generator and the discriminator is shown in fig. 3, the noise input dimension noise _ dim of the generator is set to 50, the output dimension of the generator and the input dimension out _ dim of the discriminator are set to be the feature numbers of the 12 sub-datasets used in the present embodiment, and the factor dataset features are different. In the Adam optimizer used by the generator and discriminator, the learning rate superparameter was set to 0.05, the Betas superparameter was set to (0.5,0.999), and the iteration number superparameter was set to 5000.

And (4) executing the steps 3-4.

Step 5 is executed, in this embodiment, the number M of weak classifiers set by ensemble learning is 200, 200 decision tree weak classifiers are trained on training set data, and finally a software defect strong classifier is obtained by weighting through updating of classifier weights, and the method performance is evaluated on a test set by using the strong classifier.

For the evaluation of the software defect prediction model, since the final prediction result is a two-classification result, the two-classification confusion matrix is calculated in this embodiment to obtain four evaluation indexes of accuracy, recall and F-measure, where the confusion matrix is shown in table 2.

TABLE 2 two-class confusion matrix

According to the True type of the sample and the result of model prediction, the method can be divided into four cases of True Positive (TP), False Negative (FN), False Positive (FP) and True Negative (TN), and the other four evaluation indexes are calculated in formulas (9), (10), (11) and (12).

The calculation formula of the accuracy (accuracycacy) is shown in formula (9).

The calculation formula of the precision ratio (precision) is shown in formula (10).

The calculation formula of the recall ratio (recall) is shown in formula (11).

The formula for F-measure is shown in equation (12).

For the classification results obtained in this example, the performance of the model on the test set was examined according to the performance evaluation indexes described in equations (9), (10), (11), and (12). Wherein, before and after resampling, the change line graphs of the software defect prediction model on the accuracy correct rate index are shown in fig. 4, taking a CM1 data set as an example. The classification method evaluation of accuracy, precision, recall and F-measure on the test set data (taking the average of 10 cross-validations) is shown in Table 3.

TABLE 3 test set Classification method evaluation

Evaluation index	Accuracy rate	Rate of accuracy	Recall rate	F-measure
					CM1	78％	87％	86％	0.87
JM1	73％	83％	82％	0.83
					KC1	66％	79％	73％	0.76
KC3	81％	85％	92％	0.89
					MC1	85％	87％	88％	0.88
MC2	77％	76％	88％	0.81
					MW1	82％	90％	89％	0.89
PC1	88％	94％	92％	0.93
					PC2	84％	97％	87％	0.87
PC3	85％	91％	91％	0.91
					PC4	86％	92％	91％	0.92
PC5	71％	80％	80％	0.80

Claims

1. The software defect prediction method based on generation of the countermeasure network and ensemble learning is characterized by comprising the following steps of:

2. The software bug prediction method based on generation countermeasure network and ensemble learning of claim 1, wherein the loss functions of the generator and discriminator are as shown in formula (1) and formula (2):

DLoss＝BCELoss(x_true,y_true)+BCELoss(x_fake,y_fake) (1)

GLoss＝BCELoss(x_fake,y_true) (2)

3. The software defect prediction method based on generation countermeasure network and ensemble learning of claim 2, wherein the generator in step 2 comprises an initial random noise input layer, a final generated data output layer and a block type structure composed of a linear layer, a batch normalization layer and a LeakyReLU activation function, wherein the batch normalization layer is not arranged in the first block type structure after the noise input layer, the batch normalization layer is also not arranged in the previous block type structure of the data output layer, a Sigmoid activation function is adopted in the data output layer, the noise input dimension is noise _ dim, the noise input dimension passes through 3 block type structures, the final output dimension is out _ dim;

4. The software bug prediction method based on generation of countermeasure networks and ensemble learning of claim 1, wherein the step 3 comprises the steps of:

5. The software defect prediction method based on generation of countermeasure networks and ensemble learning of claim 1, wherein the specific operation of the step 4 is as follows: step 4.1, calculating to obtain few-sample software defect data needing to be generated by using the generator G in the generated countermeasure network trained in the step 3 according to the resampling ratio k in the step 1, inputting random noise from 0 to 1 into the generator G obtained in the step 3, and obtaining generated few-class defect data through calculation of the generator;

6. The software bug prediction method based on generation of countermeasure networks and ensemble learning of claim 1, wherein the specific operation of the step 5 is as follows:

7. The software defect prediction method based on generation of the countermeasure network and ensemble learning of claim 6, wherein the training of the software defect strong classifier C in step 5.1 adopts AdaBoost learning method, and the process of training to obtain the software defect strong classifier is shown in formulas (6), (7), (8):