CN111198820B

CN111198820B - Cross-project software defect prediction method based on shared hidden layer self-encoder

Info

Publication number: CN111198820B
Application number: CN202010001850.9A
Authority: CN
Inventors: 荆晓远; 李娟娟; 吴飞; 孙莹
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2022-08-26
Anticipated expiration: 2040-01-02
Also published as: CN111198820A

Abstract

The invention discloses a cross-project software defect prediction method based on a shared hidden layer self-encoder, which comprises the following steps of firstly, preprocessing a data set and dividing a training set and a test set; secondly, extracting features by adopting a self-encoder with a sharing mechanism, and respectively extracting the depth features of a training set and a testing set; and finally, introducing a focus loss function and training a classifier. The invention solves the problem of characteristic distribution difference in cross-project software defect prediction and provides a focus loss sharing hidden layer-based self-encoder technology for the first time, so that different data distributions become more similar, different weights are distributed to different types of samples by utilizing a focus loss learning technology to solve class imbalance, and different weights are given to samples which are easy to classify and samples which are difficult to classify so that a classifier can better learn samples which are difficult to classify.

Description

Cross-project software defect prediction method based on shared hidden layer self-encoder

Technical Field

The invention belongs to the field of software engineering, and particularly relates to a cross-project software defect prediction method based on a shared hidden layer self-encoder.

Background

Software defect prediction is a research hotspot in the field of software engineering. The main aim of the method is to discover defects in software in advance in the early process of software development and improve the quality of software products. Most previous studies focused on the problem of intra-project defect prediction, mainly training a prediction model using a portion of historical data of the same project, and then testing the model's ability to predict defects using the remaining data of the same project. However, for a newly launched project, there is not enough historical data to train the model, and the performance of the in-project defect prediction will be poor. Cross-project defect prediction is therefore a viable approach when there is not enough historical defect data to build an accurate prediction model. Cross-project defect prediction is to train a prediction model using historical data of other projects and perform defect prediction on the new project. But the prediction performance of the method is still poor, the main reason is that the data distribution difference between the source project and the target project exists, and if the data distribution difference is smaller, the cross-project defect prediction effect is better. In addition, because the data set itself has class imbalance, that is, the number of non-defective classes is much larger than that of defective classes, the class imbalance problem will reduce the prediction performance of the model, so that the model can more easily identify the samples of non-defective classes, and the model will have poor prediction performance for the defective samples. Therefore, the invention is mainly provided for solving the problem of data distribution difference and class imbalance in software defect prediction.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects in the prior art, a cross-project software defect prediction method based on a shared hidden layer self-encoder is provided, and a shared hidden layer self-encoder method is introduced to solve the problem of data distribution difference.

The invention content is as follows: the invention relates to a cross-project software defect prediction method based on a shared hidden layer self-encoder, which comprises the following steps:

(1) dividing a pre-acquired data set into a training set and a testing set, and performing data preprocessing;

(2) extracting features by adopting a self-encoder with a sharing mechanism, and respectively extracting depth features of a training set and a test set;

(3) and (4) introducing a focus loss function and training a classifier.

Further, the preprocessing of the step (1) is realized by the following formula:

wherein, P _i The characteristic value after normalization preprocessing is carried out on a given characteristic x, x is the given characteristic, the maximum value and the minimum value of the given characteristic x are respectively expressed as max (x) and min (x), x _i For each eigenvalue of the characteristic x.

Further, the self-encoder of the sharing mechanism in step (2) is a self-encoder that adds a sharing parameter mechanism to obtain a sharing hidden layer, and the implementation process is as follows: by minimizing the reconstruction error L (theta) _all ) To obtain a depth characterization L (theta) of the hidden layer _all )，L(θ _all ) Comprises two parts: l (theta) _tr ) And L (theta) _te )，L(θ _tr ) And L (θ) _te ) Defined in the form:

wherein, L (theta) _tr ) Euclidean distance for input and output of training data to represent reconstruction error between input and output of training data, L (θ) _tr ) Is composed of three parts including reconstruction error loss term and intra-class loss term

And inter-class loss term

In order to be a global inter-class loss term,

is a local inter-class loss term, L (θ) _te ) The euclidean distance between the input and output of the test data, to represent the reconstruction error between the input and output of the test data,

means that for each sample of class 0, a distance is selected among the samples of class 1The mean of the k nearest neighbor samples to it;

is to select the mean of the k nearest neighbor samples from the class 0 samples for each sample of class 1.

Refers to the features of the training data set after decoding,

is a feature of the test data set after decoding,

are samples of all classes 0 in the training data,

are samples of all classes 1 in the training data,

and

the decoded training data with class 0 and class 1 respectively,

and

respectively, a sample mean value of class 0 and a sample mean value of class 1 after decoding the training data, while optimizing L (theta) _tr ) And L (theta) _te ) The final objective function is expressed as follows:

L(θ _all )＝L(θ _tr )+rL(θ _te ) (9)

wherein, L (theta) _all ) All parameters theta of the network that need to be optimized for the depth characteristics of the hidden layer _all Comprises the following steps:

r is a regularization parameter.

Further, the focus loss function in step (3) is implemented by the following formula:

wherein N is _tr Representing the number of training samples, c representing the class of labels, k representing the number of label classes,

a label representing the authenticity of the tag,

representing the probability of a predictive label, g (-) is an activation function, u is a sample class weight, a small weight u (0 < u < 1) is given for non-defective class samples, a large weight 1-u (0 < u < 1) is given for defective class samples, and the sum of the two weights is 1,

for samples that are more easily classified in defect prediction, weights are

The smaller the value is, the more difficult samples to classify in the defect prediction, the weight

The larger the value.

Has the beneficial effects that: compared with the prior art, the invention has the following beneficial effects: the invention solves the problem of characteristic distribution difference in cross-project software defect prediction and provides a focus loss sharing hidden layer-based self-encoder technology for the first time, so that different data distributions become more similar, different weights are distributed to samples of different classes by utilizing a focus loss learning technology to solve class imbalance, different weights are given to samples easy to classify and samples difficult to classify to enable a classifier to better learn samples difficult to classify, and the problems of data distribution difference in software defect prediction and class imbalance of a data set are solved; experimental results on 10 items of the PROMISE dataset show that the proposed method achieves ideal defect prediction.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings: as shown in fig. 1, a cross-project software defect prediction method based on a shared hidden layer self-encoder includes the following steps:

step 1, dividing a training data set and a testing data set, and performing data preprocessing on the data set, wherein the specific method comprises the following steps: first, a PROMISE data set is selected, which has 20 basic metrics, and these 20 basic metrics are not in the same order of magnitude, so we should use the min-max data normalization method to convert all the metrics to the interval of 0 to 1. Given a feature x, its maximum and minimum values are represented as: max (x) and min (x). For each eigenvalue x of the eigenvalue x _i The data preprocessing can be expressed as follows:

and 2, extracting features by using an improved self-encoder. A shared hidden layer self-encoder is adopted to extract features, and a sharing mechanism is added in original self-encoding to solve the problem of data distribution difference in cross-project defect prediction. Suppose that

X _tr And X _te Respectively a training data set and a test data set. X is an element of { X ∈ } _tr ∪X _te Is the set of shuffled training and test data. Where N is the number of features, N _tr And N _te The number of instances in the training set and test set, respectively. Conventional self-encoders attempt to find a common depth signature representation from the input data, thereby making the output as equal as possible to the input. Usually comprising two stages, encoding and decoding, given input data x ⁱ ∈X _tr Then the encoding and decoding stages are represented respectively as follows:

and (3) an encoding stage: y (x) ⁱ )＝f(w ₁ x ⁱ +b ₁ ) (2)

And a decoding stage:

wherein x is ⁱ Is an input to the computer system that is,

is the output, y (x) ⁱ ) Is the output of the hidden layer, f (-) is a non-linear activation function, usually a sigmoid function, w ₁ ∈R ^m×n And w ₂ ∈R ^n×m Is a weight matrix, b ₁ ∈R ^m And b ₂ ∈R ⁿ If it is a deviation, the network parameters from the encoder can be expressed as: θ ═ w ₁ ,b ₁ ,w ₂ ,b ₂ The updated optimization of the parameters can be achieved by minimizing the reconstruction error function L (θ), which is minimized by Adam optimizer during the training of the self-encoder, and is expressed as follows:

in order to solve the problem of data distribution difference in cross-project defect prediction, the invention improves the original self-encoder, and adds a shared parameter mechanism to obtain the self-encoder of a shared hidden layer. By minimizing the reconstruction error L (theta) _all ) To obtain a depth characterization L (theta) of the hidden layer _all )，L(θ _all ) Comprises 2 parts: l (theta) _tr ) And L (theta) _te )。L(θ _tr ) By calculating Euclidean distance of training data input and outputTo express their reconstruction errors, and at the same time, in order to fully utilize the label information in the source data, add intra-class loss, global inter-class loss and local inter-class loss, in order to maximize the inter-class distance and minimize the intra-class distance of the data in the source domain during the feature learning process. L (theta) _tr ) Is composed of three parts including reconstruction error loss term and intra-class loss term

And inter-class loss term

Wherein the reconstruction error loss term is to reconstruct the input with better output; the intra-class loss term is to keep samples of the same class in the source data sufficiently close to the class center to achieve intra-class minimization; fully considering global inter-class loss terms

And local inter-class loss terms

In order to have the class centers of the two classes sufficiently distant,

in order to make each sample with the class of 0(1) as far as the center of the nearest k adjacent samples with the class of 1(0), the distance between the samples is as far as possible, so that the purpose of maximizing the inter-class distance is achieved. L (theta) _te ) The reconstruction error between the input and output of the test data is represented by calculating the euclidean distance between them. L (theta) _tr ) And L (theta) _te ) Is defined as follows:

wherein

Means that for each sample with class 0, the mean of the k nearest neighbor samples from the sample with class 1 is selected; the same reason is that

Similar to the previous meaning.

Refers to the features of the training data set after decoding,

is a feature of the test data set after decoding.

Are samples of all classes 0 in the training data,

are samples of all classes 1 in the training data.

And

the decoded training data have a class of 0 and a class of 1, respectively.

And

respectively, a sample mean value of class 0 and a sample mean value of class 1 after decoding the training data. Combining the above two formulas, optimizing L (theta) simultaneously _tr ) And L (theta) _te ) The final objective function is expressed as follows:

L(θ _all )＝L(θ _tr )+rL(θ _te ) (9)

all parameters theta of the network that need to be optimized _all Comprises the following steps:

r is a regularization parameter that facilitates regularization of the behavior of the self-encoder. The purpose of adding the regularization term is to make the feature distributions of the training data and the test data more and more similar by changing the value of r.

And 3, introducing a focus loss technology and training the improved focus loss classifier. In order to enable the network to discover and learn the characteristics of defective modules, the network is able to distinguish between defective modules and non-defective modules, since the number of defective modules is small due to the class imbalance problem of the data set itself. The invention introduces a focus loss technology, samples are balanced by distributing different weights to the samples of different categories in the training process, whether the samples are easy to classify is considered, the samples which are easy to classify are endowed with smaller weights, and the samples which are easy to be wrongly classified have larger weights, so that the category imbalance is relieved. And finally, training the classifier by using the deep characteristic representation of the training data obtained in the step 2. The classifier penalty C may use a cross-entropy penalty function to compute the similarity between the true label and the predicted label, defined as follows:

wherein N is _tr Presentation trainingThe number of samples, c represents the category of the label, k represents the number of categories of the label, where the number of categories is 2,

a label representing the authenticity of the tag,

representing the probability of predicting the label, g (-) is the activation function.

Based on the classifier, we add two-part weight u and

a focus loss function is proposed. u is mainly to solve the class imbalance problem, and a small weight u (0 < u < 1) is given to a class (non-defective class) with a large number of samples, and a large weight 1-u (0 < u < 1) is given to a class (defective class) with a small number of samples, and the sum of the two weights is 1, so that the numbers of samples of two different classes are balanced.

The method mainly solves the problem of difficult classification of samples in the process of defect prediction learning, and for classes with a large number of samples, namely defect-free classes, the classifier can learn and judge which class belongs to more easily, so that a larger probability value is obtained

Thus the easier the samples to classify in defect prediction, the weights

The smaller the value. Vice versa, the more difficult samples to classify in defect prediction, the weights

The larger the value is, the more the classifier places more attention to samples difficult to classify, so that the classifier can better learn the difficult classificationCharacteristics of the sample. The final focus loss function can thus be expressed in the form:

in order to verify whether the algorithm has good superiority or not, the cross-project software defect prediction algorithm based on the focus loss shared hidden layer self-encoder is compared with other 5 cross-project defect prediction methods such as TCA +, TDS, Dycom, LT and SHLA (cross-project defect prediction algorithm of the focus loss-free shared hidden layer self-encoder). The 10 items of the project were compared and verified as experimental data, respectively, as shown in table 1: where # instance represents the number of instances, # defect represents the number of defective instances, and% defect represents the proportion of defective instances to all instances.

Table 1 10 entries in the project data set used in the experiment

Datasets	#instance	#defect	％defect
				ant-1.7	745	166	22.28
camel-1.6	965	188	19.48
				jedit-3.2	272	90	33.09
log4j-1.0	135	34	25.19
				lucene-2.0	195	91	46.67
poi-1.5	237	141	59.49
				redaktor	176	27	15.34
synapse-1.0	157	16	10.19
				xalan-2.6	885	411	46.44
xerces-1.3	453	69	15.23

The evaluation indexes of the prediction model are mainly F-measure and Accuracy. May be represented by the TP, FN, FP, and TN defined in Table 2, as shown in Table 2:

TABLE 2 confusion matrix

And (6) recall: the classifier predicts the defective samples in proportion to all the defective samples, i.e., call TP/(TP + FN). precision: the classifier predicts the correct prediction ratio of the defective samples, i.e., precision (TP/(TP + FP), and the ratio evaluates the correctness of the model prediction of the defective module. The F-measure index is a weighted average of the recill and precision, i.e., F-measure ═ (2 precision:)/(precision + precision). The Accuracy index is an index for evaluating the degree to which both defective and non-defective modules are correctly classified, i.e., Accuracy ═ TP + TN)/(TP + TN + FP + FN). The larger the numerical values of F-measure and Accuracy indicate the better the prediction performance of the software defect prediction model.

The experimental setup here is to select 1 of 10 items in the project from the project plan as test data (or target item), and the remaining 9 items in turn as source items (or training items). Thus there are 9 cross-project combinations for each target project, and there are 90 possible cross-project combinations for 10 projects. In the training process of the self-encoder, the model has 4 hidden layers, and the number of nodes of each hidden layer is set as: 20-15-10-10-2, where 20 refers to the characteristic dimension of the input data and 2 refers to the characteristic dimension of the incoming softmax classifier data. In the training process of the model, a ReLU activation function is adopted for each layer and the setting of the number of layers is empirically obtained, and an Adam optimizer is used when parameter optimization is performed. Each mini-batch in the experiment was set to 64, and the range setting for the hyperparameter r was: r ∈ {0.1,0.5,1,5,10,15}, and good effects were obtained when r ∈ 10.

To verify whether the algorithm herein performs well in several comparison algorithms, experiments were performed on 10 items of PROMISE, the experimental results of F-measure are shown in Table 3, and the experimental results of Accuracy are shown in Table 4:

TABLE 3 experimental results of our model and 4 comparison algorithms on F-measure

As can be seen from the experimental results in table 3: the F-measure values of our model exceed other 5-group contrast algorithms and the variation of the F-measure values ranges from 0.257 to 0.649, and our model improves the F-measure results by at least 0.019-0.418.

TABLE 4 results of our model experiments with 5 comparison algorithms on Accuracy

target	TDS	TCA+	Dycom	LT	SHLA	Ours
							ant-1.7	0.680	0.684	0.674	0.675	0.631	0.721
camel-1.6	0.742	0.618	0.769	0.722	0.731	0.639
							jedit-3.2	0.593	0.663	0.710	0.599	0.702	0.722
log4g-1.0	0.715	0.657	0.763	0.726	0.711	0.716
							lucene-2.0	0.538	0.621	0.600	0.533	0.621	0.637
poi-1.5	0.559	0.576	0.435	0.527	0.611	0.618
							redaktor	0.579	0.556	0.386	0.648	0.361	0.495
synapse-1.0	0.761	0.641	0.796	0.643	0.592	0.613
							xalan-2.6	0.417	0.591	0.603	0.531	0.582	0.611
xerces-1.3	0.714	0.627	0.764	0.757	0.810	0.814
							average	0.630	0.623	0.650	0.636	0.635	0.659
improved	0.029	0.036	0.009	0.023	0.024	-

As can be seen from the experimental results in table 4: the Accuracy values of our model are improved somewhat over the other 5-group comparison algorithms, the Accuracy mean of our model is 0.659, and the results of Accuracy are improved by at least 0.009 ═ 0.659 to 0.650.

Through the above experiments, it can be seen that: TCA +, TDS, Dycom, LT and SHLA algorithms can have better F-measure value and Accuracy value on certain projects, but the model provided by the invention has better F-measure average value and Accuracy average value on the whole, and the effect is better than that of the former 5 algorithms, thereby indicating the superiority of the method provided by the invention.

In addition to the above embodiments, the present invention may have other embodiments. All technical solutions formed by adopting equivalent substitutions or equivalent transformations fall within the protection scope of the claims of the present invention.

Claims

1. A cross-project software defect prediction method based on a shared hidden layer self-encoder is characterized by comprising the following steps:

(3) introducing a focus loss function, and training a classifier;

the shared mechanism self-encoder in step (2) is a self-encoder for adding a shared parameter mechanism to obtain a shared hidden layer, and the implementation process is as follows: by minimizing the reconstruction error L (theta) _all ) To obtain a depth characterization L (theta) of the hidden layer _all )，L(θ _all ) Comprises two parts: l (theta) _tr ) And L (theta) _te )，L(θ _tr ) And L (theta) _te ) Is defined as follows:

And inter-class loss term

In order to be a global inter-class loss term,

means that for each sample with class 0, the mean of the k nearest neighbor samples from the sample with class 1 is selected;

for each sample with class 1, selecting the mean of the k nearest neighbor samples from the samples with class 0,

refers to the features of the training data set after decoding,

is a feature of the test data set after decoding,

are samples of all classes 0 in the training data,

are samples of all classes 1 in the training data,

and

the decoded training data with class 0 and class 1 respectively,

and

L(θ _all )＝L(θ _tr )+rL(θ _te ) (9)

wherein, L (θ) _all ) All parameters theta of the network that need to be optimized for the depth characteristics of the hidden layer _all Comprises the following steps:

r is a regularization parameter;

the focus loss function in the step (3) is realized by the following formula:

a label representing the authenticity of the tag,

for samples that are more easily classified in defect prediction, weights are

The smaller the value is, the more difficult to classify samples in defect prediction, the weight

The larger the value.

2. The method for cross-project software defect prediction based on shared hidden layer self-encoder as claimed in claim 1, wherein the preprocessing of step (1) is implemented by the following formula:

wherein, P _i The characteristic value after normalization preprocessing is carried out on a given characteristic x, wherein x is the given characteristic, the maximum value and the minimum value of the given characteristic x are respectively expressed as max (x) and min (x), and x is _i For each eigenvalue of the characteristic x.