CN111198820A

CN111198820A - Cross-project software defect prediction method based on shared hidden layer self-encoder

Info

Publication number: CN111198820A
Application number: CN202010001850.9A
Authority: CN
Inventors: 荆晓远; 李娟娟; 吴飞; 孙莹
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2020-05-26
Anticipated expiration: 2040-01-02
Also published as: CN111198820B

Abstract

The invention discloses a cross-project software defect prediction method based on a shared hidden layer self-encoder, which comprises the following steps of firstly, preprocessing a data set and dividing a training set and a test set; secondly, extracting features by adopting a self-encoder with a sharing mechanism, and respectively extracting depth features of a training set and a testing set; and finally, introducing a focus loss function and training a classifier. The invention solves the problem of characteristic distribution difference in cross-project software defect prediction and provides a focus loss sharing hidden layer-based self-encoder technology for the first time, so that different data distributions become more similar, different weights are distributed to different types of samples by utilizing a focus loss learning technology to solve class imbalance, and different weights are given to samples which are easy to classify and samples which are difficult to classify so that a classifier can better learn samples which are difficult to classify.

Description

Cross-project software defect prediction method based on shared hidden layer self-encoder

Technical Field

The invention belongs to the field of software engineering, and particularly relates to a cross-project software defect prediction method based on a shared hidden layer self-encoder.

Background

Software defect prediction is a research hotspot in the field of software engineering. The main aim of the method is to discover defects in software in advance in the early process of software development and improve the quality of software products. Most previous studies focused on the problem of intra-project defect prediction, mainly training a prediction model using a portion of historical data of the same project, and then testing the model's ability to predict defects using the remaining data of the same project. However, for a newly launched project, there is not enough historical data to train the model, and the performance of the in-project defect prediction will be poor. Cross-project defect prediction is therefore a viable approach when there is not enough historical defect data to build an accurate prediction model. Cross-project defect prediction is to train a prediction model using historical data of other projects and perform defect prediction on the new project. But the prediction performance of the method is still poor, the main reason is that the data distribution difference between the source project and the target project exists, and if the data distribution difference is smaller, the cross-project defect prediction effect is better. In addition, because the data set itself has class imbalance, that is, the number of non-defective classes is much larger than that of defective classes, the class imbalance problem may reduce the prediction performance of the model, which results in that the model may more easily identify the samples of non-defective classes, and may exhibit poor prediction capability for the defective samples. Therefore, the invention is mainly provided for solving the problem of data distribution difference and class imbalance in software defect prediction.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects in the prior art, a cross-project software defect prediction method based on a shared hidden layer self-encoder is provided, and a shared hidden layer self-encoder method is introduced to solve the problem of data distribution difference.

The invention content is as follows: the invention relates to a cross-project software defect prediction method based on a shared hidden layer self-encoder, which comprises the following steps:

(1) dividing a pre-acquired data set into a training set and a testing set, and performing data preprocessing;

(2) extracting features by adopting a self-encoder with a sharing mechanism, and respectively extracting depth features of a training set and a test set;

(3) and (4) introducing a focus loss function and training a classifier.

Further, the preprocessing of the step (1) is realized by the following formula:

wherein, P_iThe characteristic value after normalization preprocessing is carried out on a given characteristic x, wherein x is the given characteristic, the maximum value and the minimum value of the given characteristic x are respectively expressed as max (x) and min (x), and x is_iFor each eigenvalue of the characteristic x.

Further, the self-encoder of the sharing mechanism in step (2) is a self-encoder that adds a sharing parameter mechanism to obtain a sharing hidden layer, and the implementation process is as follows: by minimizing the reconstruction error L (theta)_all) To obtain a depth characterization L (theta) of the hidden layer_all)，L(θ_all) Comprises two parts: l (theta)_tr) And L (theta)_te)，L(θ_tr) And L (theta)_te) Is defined as follows:

wherein, L (theta)_tr) Representing the reconstruction error between the input and output of the training data for the Euclidean distance of the input and output of the training data, L (theta)_tr) Is composed of three parts including reconstruction error loss term and intra-class loss term

And inter-class loss term

In order to be a global inter-class loss term,

is a local inter-class loss term, L (θ)_te) The euclidean distance between the input and output of the test data, to represent the reconstruction error between the input and output of the test data,

means that for each sample with class 0, the mean of the k nearest neighbor samples from the sample with class 1 is selected;

is to select the mean of the k nearest neighbor samples from the class 0 samples for each sample of class 1.

Refers to the features of the training data set after decoding,

is a feature of the test data set after decoding,

are samples of all classes 0 in the training data,

are samples of all classes 1 in the training data,

and

the decoded training data with class 0 and class 1 respectively,

and

respectively, a sample mean value of class 0 and a sample mean value of class 1 after decoding the training data, while optimizing L (theta)_tr) And L (theta)_te) The final objective function is expressed as follows:

L(θ_all)＝L(θ_tr)+rL(θ_te) (9)

wherein, L (theta)_all) All parameters theta of the network that need to be optimized for the depth characteristics of the hidden layer_allComprises the following steps:

r is a regularization parameter.

Further, the focus loss function in step (3) is implemented by the following formula:

wherein N is_trRepresenting the number of training samples, c representing the class of labels, k representing the number of label classes,

a label representing the authenticity of the tag,

representing the probability of a predictive label, g (-) is an activation function, u is a sample class weight, a small weight u (0 < u < 1) is given for non-defective class samples, a large weight 1-u (0 < u < 1) is given for defective class samples, and the sum of the two weights is 1,

for samples that are more easily classified in defect prediction, weights are

The smaller the value is, the more difficult samples to classify in the defect prediction, the weight

The larger the value.

Has the advantages that: compared with the prior art, the invention has the beneficial effects that: the invention solves the problem of characteristic distribution difference in cross-project software defect prediction and provides a focus loss sharing hidden layer-based self-encoder technology for the first time, so that different data distributions become more similar, different weights are distributed to samples of different classes by utilizing a focus loss learning technology to solve class imbalance, different weights are given to samples easy to classify and samples difficult to classify to enable a classifier to better learn samples difficult to classify, and the problems of data distribution difference in software defect prediction and class imbalance of a data set are solved; experimental results on 10 items of the PROMISE dataset show that the proposed method achieves ideal defect prediction.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings: as shown in fig. 1, a cross-project software defect prediction method based on a shared hidden layer self-encoder includes the following steps:

step 1, dividing a training data set and a testing data set, and performing data preprocessing on the data set, wherein the specific method comprises the following steps: first, a PROMISE data set is selected, which has 20 basic metrics, and these 20 basic metrics are not in the same order of magnitude, so we should use the min-max data normalization method to convert all the metrics to the interval of 0 to 1. Given a feature x, its maximum and minimum values are represented as: max (x) and min (x). For each of the features xCharacteristic value x_iThe data preprocessing can be expressed as follows:

and 2, extracting features by using an improved self-encoder. The method adopts a shared hidden layer self-encoder to extract features, and adds a sharing mechanism in the original self-encoding to solve the problem of data distribution difference in cross-project defect prediction. Suppose that

X_trAnd X_teRespectively a training data set and a test data set. X is an element of { X ∈ }_tr∪X_teIs the set of shuffled training and test data. Where N is the number of features, N_trAnd N_teThe number of instances in the training set and test set, respectively. Conventional self-encoders attempt to find a common depth signature representation from the input data, thereby making the output as equal as possible to the input. Usually comprising two stages, encoding and decoding, given input data xⁱ∈X_trThen the encoding and decoding stages are represented respectively as follows:

and (3) an encoding stage: y (x)ⁱ)＝f(w₁xⁱ+b₁) (2)

And a decoding stage:

wherein xⁱIs an input to the computer system that is,

is the output, y (x)ⁱ) Is the output of the hidden layer, f (-) is a non-linear activation function, usually a sigmoid function, w₁∈R^m×nAnd w₂∈R^n×mIs a weight matrix, b₁∈R^mAnd b₂∈RⁿIf it is a deviation, the network parameters from the encoder can be expressed as: θ ═ w₁,b₁,w₂,b₂The updated optimization of the parameters can be achieved by minimizing the reconstruction error function L (θ), which is minimized by Adam optimizer during the training of the self-encoder, and is expressed as follows:

in order to solve the problem of data distribution difference in cross-project defect prediction, the invention improves the original self-encoder, and adds a shared parameter mechanism to obtain the self-encoder of a shared hidden layer. By minimizing the reconstruction error L (theta)_all) To obtain a depth characterization L (theta) of the hidden layer_all)，L(θ_all) Comprises 2 parts: l (theta)_tr) And L (theta)_te)。L(θ_tr) The reconstruction errors of training data are expressed by calculating Euclidean distances of input and output of the training data, and meanwhile, in order to fully utilize label information in source data, intra-class loss, global inter-class loss and local inter-class loss are added, so that the aim of maximizing the inter-class distance and minimizing the intra-class distance of the data in a source domain is achieved in the characteristic learning process. L (theta)_tr) Is composed of three parts including reconstruction error loss term and intra-class loss term

And inter-class loss term

Wherein the reconstruction error loss term is to reconstruct the input with better output; the intra-class loss term is to keep samples of the same class in the source data sufficiently close to the class center to achieve intra-class minimization; fully considering global inter-class loss terms

And local inter-class loss terms

In order to have the class centers of the two classes sufficiently distant,

in order to make each sample with the class of 0(1) as far as the center of the nearest k adjacent samples with the class of 1(0), the distance between the samples is as far as possible, so that the purpose of maximizing the inter-class distance is achieved. L (theta)_te) The reconstruction error between the input and output of the test data is represented by calculating the euclidean distance between them. L (theta)_tr) And L (theta)_te) Is defined as follows:

wherein

Means that for each sample with class 0, the mean of the k nearest neighbor samples from the sample with class 1 is selected; in the same way

Similar to the previous meaning.

Refers to the features of the training data set after decoding,

is a bit of the test data set after decodingAnd (5) carrying out characterization.

Are samples of all classes 0 in the training data,

are samples of all classes 1 in the training data.

And

the decoded training data have a class of 0 and a class of 1, respectively.

And

the sample mean of class 0 and the sample mean of class 1, respectively, after decoding the training data. Combining the above two formulas, optimizing L (theta) simultaneously_tr) And L (theta)_te) The final objective function is expressed as follows:

L(θ_all)＝L(θ_tr)+rL(θ_te) (9)

all parameters theta of the network that need to be optimized_allComprises the following steps:

r is a regularization parameter that facilitates regularization of the behavior of the self-encoder. The purpose of adding the regularization term is to make the feature distributions of the training data and the test data more and more similar by changing the value of r.

And 3, introducing a focus loss technology and training the improved focus loss classifier. In order to enable the network to discover and learn the characteristics of defective modules, the network is able to distinguish between defective modules and non-defective modules, since the number of defective modules is small due to the class imbalance problem of the data set itself. The invention introduces a focus loss technology, samples of different types are distributed with different weights in the training process to balance the samples, whether the samples are easy to classify is considered, the samples easy to classify are endowed with smaller weights, and the samples easy to be wrongly classified have larger weights, so that the class imbalance is relieved. And finally, training the classifier by using the deep characteristic representation of the training data obtained in the step 2. The classifier penalty C may use a cross-entropy penalty function to compute the similarity between the true label and the predicted label, defined as follows:

wherein N is_trRepresenting the number of training samples, c representing the class of labels, k representing the number of label classes, where the number of classes is 2,

a label representing the authenticity of the tag,

representing the probability of predicting the label, g (-) is the activation function.

Based on the classifier, we add two-part weight u and

a focus loss function is proposed. u mainly solves the class imbalance problem, and for the class with the large number of samples (non-defective class), a small weight u (0 < u < 1) is given, and for the class with the small number of samples (defective class), a large weight 1-u (0 < u < 1) is given, and the sum of the two weights is 1, so that the numbers of samples of the two different classes are balanced.

The method mainly solves the problem of difficult classification of samples in the process of defect prediction learning, and for classes with a large number of samples, namely defect-free classes, a classifier can learn and judge which class belongs to more easily so as to obtain the classesA relatively large probability value

Thus the easier the samples to classify in defect prediction, the weights

The smaller the value. Vice versa, the more difficult samples to classify in defect prediction, the weights

The larger the value is, the more the classifier places more attention to samples which are difficult to classify, so that the classifier can better learn the characteristics of the samples which are difficult to classify. The final focus loss function can thus be expressed in the form:

in order to verify whether the algorithm has good superiority or not, the cross-project software defect prediction algorithm based on the focus loss shared hidden layer self-encoder is compared with other 5 cross-project defect prediction methods, namely TCA +, TDS, Dycom, LT and SHLA (cross-project defect prediction algorithm of the shared hidden layer self-encoder without focus loss). The 10 items of the project were compared and verified as experimental data, respectively, as shown in table 1: where # instance represents the number of instances, # defect represents the number of defective instances, and% defect represents the proportion of defective instances to all instances.

Table 1 10 entries in the project data set used in the experiment

Datasets	#instance	#defect	％defect
				ant-1.7	745	166	22.28
camel-1.6	965	188	19.48
				jedit-3.2	272	90	33.09
log4j-1.0	135	34	25.19
				lucene-2.0	195	91	46.67
poi-1.5	237	141	59.49
				redaktor	176	27	15.34
synapse-1.0	157	16	10.19
				xalan-2.6	885	411	46.44
xerces-1.3	453	69	15.23

The evaluation indexes of the prediction model are mainly F-measure and Accuracy. Can be represented by the TP, FN, FP, and TN defined in Table 2, as shown in Table 2:

TABLE 2 confusion matrix

And (3) recall: the classifier predicts the defective samples in proportion to all the defective samples, i.e., call TP/(TP + FN). precision: the classifier predicts the correct prediction proportion of the defective samples, namely precision TP/(TP + FP), and the ratio evaluates the correctness degree of the model predicting the defective modules. The F-measure index is a weighted average of the recill and precision, i.e., F-measure ═ (2 precision:)/(precision + precision). The Accuracy index is an index for evaluating the degree to which both defective and non-defective modules are correctly classified, i.e., Accuracy ═ TP + TN)/(TP + TN + FP + FN). The larger the numerical values of F-measure and Accuracy indicate the better the prediction performance of the software defect prediction model.

The experimental setup here is to select 1 of 10 items in the project from the project plan as test data (or target item), and the remaining 9 items in turn as source items (or training items). Thus there are 9 cross-project combinations for each target project, and there are 90 possible cross-project combinations for 10 projects. In the training process of the self-encoder, the model has 4 hidden layers, and the number of nodes of each hidden layer is set as: 20-15-10-10-2, where 20 refers to the characteristic dimension of the input data and 2 refers to the characteristic dimension of the incoming softmax classifier data. In the training process of the model, a ReLU activation function is adopted for each layer and the setting of the number of layers is empirically obtained, and an Adam optimizer is used when parameter optimization is performed. Each mini-batch in the experiment was set to 64, and the range setting for the hyperparameter r was: r ∈ {0.1,0.5,1,5,10,15}, and a good effect was obtained when r ═ 10.

To verify whether the algorithm herein performs well in several comparison algorithms, experiments were performed on 10 items of the project, F-measure is shown in table 3, and Accuracy is shown in table 4:

TABLE 3 experimental results of our model and 4 comparison algorithms on F-measure

As can be seen from the experimental results in table 3: the F-measure values of our model exceed other 5-group contrast algorithms and the variation of the F-measure values ranges from 0.257 to 0.649, and our model improves the F-measure results by at least 0.019-0.418.

TABLE 4 results of our model experiments with 5 comparison algorithms on Accuracy

target	TDS	TCA+	Dycom	LT	SHLA	Ours
							ant-1.7	0.680	0.684	0.674	0.675	0.631	0.721
camel-1.6	0.742	0.618	0.769	0.722	0.731	0.639
							jedit-3.2	0.593	0.663	0.710	0.599	0.702	0.722
log4g-1.0	0.715	0.657	0.763	0.726	0.711	0.716
							lucene-2.0	0.538	0.621	0.600	0.533	0.621	0.637
poi-1.5	0.559	0.576	0.435	0.527	0.611	0.618
							redaktor	0.579	0.556	0.386	0.648	0.361	0.495
synapse-1.0	0.761	0.641	0.796	0.643	0.592	0.613
							xalan-2.6	0.417	0.591	0.603	0.531	0.582	0.611
xerces-1.3	0.714	0.627	0.764	0.757	0.810	0.814
							average	0.630	0.623	0.650	0.636	0.635	0.659
improved	0.029	0.036	0.009	0.023	0.024	-

As can be seen from the experimental results in table 4: the Accuracy values of our model are improved somewhat over the other 5-group comparison algorithms, the Accuracy mean of our model is 0.659, and the results of Accuracy are improved by at least 0.009 ═ 0.659 to 0.650.

Through the above experiments, it can be seen that: TCA +, TDS, Dycom, LT and SHLA algorithms can have better F-measure value and Accuracy value on certain projects, but the model provided by the invention has better F-measure average value and Accuracy average value on the whole, and the effect is better than that of the former 5 algorithms, thereby indicating the superiority of the method provided by the invention.

In addition to the above embodiments, the present invention may have other embodiments. All technical solutions formed by adopting equivalent substitutions or equivalent transformations fall within the protection scope of the claims of the present invention.

Claims

1. A cross-project software defect prediction method based on a shared hidden layer self-encoder is characterized by comprising the following steps:

(3) and (4) introducing a focus loss function and training a classifier.

2. The method for cross-project software defect prediction based on shared hidden layer self-encoder as claimed in claim 1, wherein the preprocessing of step (1) is implemented by the following formula:

3. The method according to claim 1, wherein the shared mechanism auto-encoder in step (2) is an auto-encoder that adds a shared parameter mechanism to obtain a shared hidden layer, and the implementation process is as follows: by minimizing the reconstruction error L (theta)_all) To obtain a depth characterization L (theta) of the hidden layer_all)，L(θ_all) Comprises two parts: l (theta)_tr) And L (theta)_te)，L(θ_tr) And L (theta)_te) Is defined as follows:

And inter-class loss term

In order to be a global inter-class loss term,

is to select the mean of the k nearest neighbor samples from the class 0 samples for each sample with class 1,

refers to the features of the training data set after decoding,

is a feature of the test data set after decoding,

are samples of all classes 0 in the training data,

are samples of all classes 1 in the training data,

and

the decoded training data with class 0 and class 1 respectively,

and

L(θ_all)＝L(θ_tr)+rL(θ_te) (9)

r is a regularization parameter.

4. The method for cross-project software defect prediction based on shared hidden layer self-encoder as claimed in claim 1, wherein the focus loss function in step (3) is implemented by the following formula:

a label representing the authenticity of the tag,

classifying sample difficultyWeights, samples that are easier to classify in defect prediction, weights

The larger the value.