CN110751186B

CN110751186B - Cross-project software defect prediction method based on supervised expression learning

Info

Publication number: CN110751186B
Application number: CN201910915935.5A
Authority: CN
Inventors: 郑征; 万晓晖
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2022-04-08
Anticipated expiration: 2039-09-26
Also published as: CN110751186A

Abstract

The invention discloses a cross-project software defect prediction method for supervised expression learning, which comprises the following steps: (1) selecting a defect data set, and preprocessing defect data; (2) training a migration self-encoder in an unsupervised pre-training mode, wherein the migration self-encoder comprises a characteristic encoding layer and a label encoding layer; (3) selecting a sample which is closest to the hidden layer feature distribution of the target project sample from all the hidden layer feature representations of the source project sample as a verification set by means of a migration cross-validation method, and taking the rest samples as a training set; (4) performing oversampling processing on a training set sample; (5) fine-tuning a migration self-encoder, selecting a model hyper-parameter and stopping a strategy in advance; (6) and inputting the preprocessed data of the target item to a migration self-encoder, and obtaining a final prediction result through the output of a label encoding layer. The method introduces the label information of the source project sample into the feature representation learning process, and improves the prediction performance of the cross-project software defect prediction model.

Description

Cross-project software defect prediction method based on supervised expression learning

Technical Field

The invention belongs to the technical field of software defect prediction of software engineering application, and particularly relates to a cross-project software defect prediction method based on supervised expression learning.

Background

Software defect prediction techniques predict defects that may exist in a current software project by learning and building a prediction model from historical defect data. The method can help testers to quickly find defects and greatly improve the software testing efficiency, so that the method becomes a research hotspot in the field of current software engineering.

The general method of software defect prediction is to extract various features from software code, such as Halstead metric, McCabe metric, CK metric, MOOD metric, code change metric and other object-oriented metrics, represent all code segments with feature vectors, mark the code segments according to the existence of actual defects, input the feature vectors and labeled labels to a machine learning model for training, and finally construct a software defect prediction model for predicting possible defects in new software code.

Most of the past software defect prediction methods are based on the traditional machine learning method to build a software defect prediction model. The conventional machine learning method needs to meet the following requirements for obtaining excellent performance: the data distribution of the training sample and the test sample is the same or similar, the positive and negative sample distribution is relatively balanced, and the labeled sample used for training is sufficient. However, in practical application, because of the great difficulty of manual labeling, labeled samples which can be used for training a model are very rare, and the occurrence probability of software defects is extremely low, most of the labeled samples are also non-defective samples, and the defective samples only account for a very small part. Therefore, the problems of rare labeling data and unbalanced category become the biggest two challenges for software defect prediction technology.

For the research of category imbalance, most of the current work is mainly processed by a data resampling method, such as random oversampling or a method of artificially synthesizing a few types of samples, and for the problem of scarcity of training data, one current solution idea is to train a prediction model by using defect data of different projects, which is a cross-project defect prediction technology. Because the labeled samples are rare, it is not enough to train the machine learning model by only using the labeled data acquired in a single project, and the basic idea of the cross-project defect prediction technology is to train the prediction model by using the defect data (also called source project or source domain) in other projects, and then apply the prediction model obtained by training to the software project to be predicted (also called target project or target domain), thereby relieving the problem of the rare training data to a certain extent.

However, one difficulty with cross-project software bug prediction is that the training data and the test data often do not satisfy the same or similar distributions, which is contrary to the assumptions of conventional machine learning models, and thus, conventional machine learning models cannot be used directly for cross-project bug prediction. In recent years, the migration learning method gradually starts to be applied to a cross-project software defect prediction task. One of the most widely used methods is a Transfer Component Analysis (TCA), which belongs to an unsupervised representation learning method and is characterized in that labeled information of a source domain sample cannot be utilized in a learning characterization process. In addition, such methods break apart the unsupervised feature learning process and the training process of the classifier by a divide and conquer approach, such as first learning hidden layer representations of the source and target item samples, and then retraining the machine learning classifier in this new feature space. However, the divide and conquer method itself has a problem: while an optimal solution can be obtained on a sub-problem when solving the sub-problem step by step, optimization on a sub-problem does not mean that an optimal solution of a global problem can be obtained. The features learned in the early stage may not be suitable for training the classifier in the later stage, which may cause the actual prediction capability of the final software defect prediction model to be affected.

Disclosure of Invention

One object of the present invention is: in order to make up for the defects of the method, the invention provides a cross-project software defect prediction method based on supervised expression learning. The method utilizes a migration self-encoder with a double-encoding layer structure, and can utilize label information of a source domain sample simultaneously in the process of learning hidden layer feature representation, thereby belonging to a supervised representation learning mode. In addition, through adjusting the loss function of the network, the training modes of unsupervised pre-training and supervised fine tuning of the network are respectively realized, after the unsupervised pre-training is finished to obtain a primary hidden layer feature representation, reasonable division of a training set and a verification set is realized under the background of transfer learning by means of a transfer cross verification method, and model hyper-parameters are selected according to the prediction performance on the verification set.

Another object of the invention is: a deep learning model named as a migration self-encoder is provided, the model provides an end-to-end learning mode for us, artificial subproblem division is not performed in the whole learning process, and the deep learning model is completely handed over to directly learn mapping from original input to expected output. Compared with a divide-and-conquer strategy, the learning mode of 'end-to-end' has the advantage of synergy, and the global optimal solution can be obtained more greatly. Experiments show that the supervised expression learning method can improve the effect of cross-project software defect prediction.

The technical scheme of the invention is as follows: a cross-project software defect prediction method based on supervised expression learning comprises the following steps:

step 1), defining a target item to be predicted and a source item used for training a model, and carrying out preprocessing operations such as standardization or normalization on the original data of the source item and the target item;

step 2), inputting the feature vectors of all samples in the source project and the target project into a migration self-encoder, preliminarily training the migration self-encoder in an unsupervised pre-training mode, and obtaining preliminary hidden layer feature representations of all samples in the source project and the target project through a feature coding layer of the migration self-encoder;

the migrating self-encoder is a novel self-encoder with a double-encoding-layer structure. The double coding layers are a characteristic coding layer and a label coding layer; the first layer of coding layer is a feature coding layer and is responsible for coding feature vectors of all samples in a source project and a target project into hidden layer feature representation, and the label coding layer realizes classification of the samples on the basis of the hidden layer feature representation. And in the training process, the supervised learning process of the source item samples is realized by minimizing the label loss items of the source item samples. Meanwhile, model weights between the source project and the target project are shared, samples of the target project can be directly input into the trained model, and a final prediction result is obtained through output of a label coding layer of the model, so that the aim of transfer learning is fulfilled.

Step 3), selecting a part of samples (for example, 1/3) which are distributed most closely to the hidden layer feature representation of the target item sample from the hidden layer feature representations of the source item samples by the aid of the migration cross-validation method through the initial hidden layer feature representation obtained in the step 2) as a validation set, and taking the rest source item samples as a training set;

step 4), considering that the samples of the training set are seriously unbalanced in the defect type and the non-defect type, carrying out oversampling treatment (such as a random oversampling method or a manual synthesis oversampling method) on the samples of the training set;

step 5), further fine-tuning the migration self-encoder on the over-sampling processed training set obtained in the step 4), and selecting a model hyper-parameter and early stopping a strategy to realize the training of the model by means of the prediction performance on the verification set;

and 6) after the training of the migration self-encoder is finished, inputting the preprocessed data of the target project into the migration self-encoder, and obtaining a final prediction result by a label coding layer of the network.

Wherein, the migration self-encoder in the step 2) and the step 5) adopts different forms of loss functions. Step 2) belongs to an unsupervised training mode, so that no label information is introduced in the training process, and the loss function at the moment consists of a reconstruction error term and a hidden layer characteristic distribution difference term. By minimizing the loss function, the network can learn the hidden layer feature representation of all samples, the hidden layer feature representation has good reconstruction performance, and the hidden layer feature distribution of the source item samples is close to that of the target item samples. And step 5) belongs to a supervised training mode, namely label information of a source item sample is introduced in the training process, and the loss function at the moment consists of 4 items of content, including a reconstruction error item, a hidden layer feature distribution difference item, a label loss item of the source item sample and a regular loss item. The model pre-training process and the fine-tuning process respectively realize two training modes of no supervision and supervised by adjusting the loss function to enable the loss function to contain or not contain the label loss item.

Compared with the existing software defect prediction method, the cross-project software defect prediction method based on supervised expression learning has the advantages that: the invention breaks through the assumption that the traditional machine learning method requires the training set and the test set to be distributed the same or similar, and can transfer information from related items to improve the learning of the current software item data. Moreover, different from the current cross-project defect prediction method of unsupervised representation learning, the migration self-encoder adopted by the invention can fully utilize the label information of the source domain sample in the process of learning the hidden layer feature representation, and can realize the feature learning and the model construction of a further-in-place end-to-end mode, thereby further improving the cross-project software defect prediction performance.

Drawings

FIG. 1 supervised representation learning method based on migratory autocoder

FIG. 2 cross-project software defect prediction method based on supervised expression learning

Detailed Description

The invention will be further described with reference to the accompanying drawings. First, a migration auto-encoder used in the present invention will be described in detail with reference to fig. 1.

The described migration self-encoder is a new type self-encoder with double-coding layer structure. The double coding layers are a characteristic coding layer and a label coding layer; the first layer of coding layer is a feature coding layer and is responsible for coding feature vectors of all samples in a source project and a target project into hidden layer feature representation, and the label coding layer realizes classification of the samples on the basis of the hidden layer feature representation. And in the training process, the supervised learning process of the source item samples is realized by minimizing the label loss items of the source item samples. Meanwhile, model weights between the source project and the target project are shared, samples of the target project can be directly input into the trained model, and a final prediction result is obtained through output of a label coding layer of the model, so that the aim of transfer learning is fulfilled.

The specific structure of the migration self-encoder is as follows:

given a tagged source domain data set

And a target domain data set to be predicted

m represents the number of features of the input sample,

0 indicates a non-defect class and 1 indicates a defect class. n is_sAnd n_tRepresenting the number of samples of the source and target domains, respectively. The loss function for migrating the self-encoder is as follows:

term 1 is the reconstruction error term for the source domain and target domain samples:

wherein,

the hidden layer at the 1 st layer of the model is a characteristic coding layer, the coding layer has k (k is less than or equal to m) nodes, and the output of the coding layer is

The weight parameter of the layer is W₁∈R^k×mBias parameter is b₁∈R^k×1. The network layer 2 is a label coding layer, which has 2 nodes, and the output of the node is z epsilon R^2×1The weight parameter of the layer is W₂∈R^2×kBias parameter is b₂∈R^2×1. For a test sample

We can estimate the probability that it belongs to a certain class as:

thus, after model training is complete, the output of the label coding layer can be used to predict the target domain samples. Output of the 3 rd hidden layer

Is the reconstructed output of a feature coding layer, the weight parameter of which is W'₂∈R^k×2The bias parameter is b'₂∈R^k×1. The output of the last 1 layer is the reconstructed output of the sample

Weight parameter W 'of the layer'₁∈R^m×kAnd b'₁∈R^m×1. Further, f is a nonlinear activation function sigmoid function.

The 2 nd term of the loss function is the distribution variance term, defined here as KL divergence:

Γ(ξ^(s),ξ^(t))＝D_KL(P_s||P_t)+D_KL(P_t||P_s) (6)

wherein,

the KL divergence is an asymmetric divergence measure that measures the difference between two probability distributions. Suppose that two different probability distributions P ∈ R are given^k×1And Q ∈ R^k×1When P is estimated approximately by Q, the loss of information from P to Q is defined as

Here, D is used_KL(P||Q)+D_KL(Q | P) to measure the distribution difference between the source domain and the target domain. By narrowing the value of the term, the distribution difference of the source domain and the target domain in the new characterization space can be minimized. The label loss term is defined as follows:

wherein,

i.e. W₂Row j of (2). The final term 1 is the canonical loss term of the model:

there are 3 coefficients to be selected for the entire loss function: α, β and γ. Together with the number n of hidden layer neuron nodes of the encoder, these belong to the hyper-parameters of the model. The value range of n is not set to be [10,50], and the value interval is 5; the value range of alpha is not set to [10,20,50,100,200], the value range of beta is set to [50,100,200,500,1000], and the value range of gamma is [0.0001,0.001,0.01,0.1 ]. In order to improve the efficiency of searching the hyper-parameters, a random searching mode is adopted, and the maximum searching frequency is 200.

The selection of the hyper-parameters is determined by cross-validation. The following describes a specific process of the migration cross-validation partitioning of the training set and the test set. The characteristic transformation adopted by the invention is obtained by a self-encoder network, so that the Nonlinear Distribution Diversity (NDD) is defined as follows:

here by adjusting the weight of the source domain samples { alpha }_i:x_i∈X_sTo minimize NDD distance:

b is an upper bound value (set to 1) to avoid alpha diverging to infinity. The optimal alpha value is calculated to minimize:

finally, { α [ ]_iAre ordered from large to small according to { alpha }_iThe order of 1/3 source domain samples are selected from among them as a validation setThe samples, 2/3 remaining, were used as a training set. After the division into data sets, random oversampling operations are performed on the training set samples to mitigate the effects of class imbalance.

A cross-project software defect prediction method based on supervised expression learning is shown in fig. 2. The technical scheme of the invention is explained in detail below with reference to fig. 2, and the specific implementation steps are as follows:

1. and (4) defining a target project and a source project, and preprocessing the target project and the source project. The invention relates to a cross-project defect prediction method, wherein a project to be predicted at present is a target project, and other projects for training are source projects. For unifying dimension, respectively carrying out normalization preprocessing on a source project and a target project to keep each dimension of an input sample of the source project and the target project at [0, 1]]In the meantime. min (x._j) And max (x._j) Represents the most significant value of the j-th dimension:

2. and unsupervised pre-training the network to obtain the initial hidden layer feature representation of the sample. Inputting all original data of a source project and a target project into a migration self-encoder, and preliminarily training the migration self-encoder in an unsupervised pre-training mode, wherein the loss function at the moment has no regular term and label loss term and only has a reconstruction error term and a distribution difference term. The initial learning rate in the pre-training process is fixed to 0.01, and the iteration number is fixed to 500.

3. The training set and validation set are partitioned by a migration cross validation method. And dividing the data set by a migration cross-validation method on the basis of the preliminary hidden layer feature representation of all the samples of the source item and the target item obtained in the step 2, wherein 1/3 source item samples closest to the hidden layer feature distribution of the target domain serve as a validation set, and the rest 2/3 samples serve as a training set. The details of migration cross-validation are as described above.

4. And carrying out oversampling processing on the training set samples. And considering that the defect type and non-defect type samples in the training set samples are seriously unbalanced, performing oversampling processing on the training set samples. The invention mainly adopts a random oversampling mode, namely randomly selecting a few types of samples to simply copy so that the total number of the few types of samples is consistent with or similar to the total number of the majority types of samples. Through the oversampling processing, the problem of category imbalance is alleviated.

5. There is further a supervised fine tuning of the self-encoder network. And further fine-tuning the migrated self-encoder on the training set subjected to sampling processing, and realizing the training of the model by selecting the hyper-parameters of the model and stopping the strategy in advance by means of the verification set. The learning rate in the process of supervised fine tuning is 0.001, and the maximum training iteration number is 5000. And checking the classification performance (mainly referring to Bal value) of the current model every fixed iteration number in the training process, and determining whether the training needs to be stopped early. The Bal value is a comprehensive index for measuring the detection rate and the false alarm rate in the classification problem. Take the two-class confusion matrix of table 1 as an example:

TABLE 1

The whole parameter fine tuning process not only comprises fine tuning of model parameters, but also comprises fine tuning of hyper-parameters. The selection of the hyper-parameter adopts a random search mode, namely, the maximum random search times (for example, 200 times) are set. And after selecting the hyper-parameters each time, retraining the network, verifying the classification performance Bal value of the current model on the verification set at fixed intervals, and storing the current optimal model. And selecting the optimal model as a final defect prediction model after the maximum search times are reached.

6. And inputting the target item data to the self-encoder to obtain a prediction result. And inputting the target item after preprocessing to the migration self-encoder network, and obtaining a final prediction result by a label coding layer of the network.

The above steps can be organized into a complete process as shown in table 2 below:

TABLE 2

The foregoing describes the cross-project software defect prediction method based on supervised expression learning according to the present invention in detail, but it is obvious that the specific implementation form of the present invention is not limited thereto. It will be apparent to those skilled in the art that various obvious changes may be made therein without departing from the spirit of the invention and the scope of the appended claims.

Claims

1. A cross-project software defect prediction method based on supervised expression learning is characterized by comprising the following steps: the method comprises the following steps:

step 1), defining a target item to be predicted and a source item used for training a model, and carrying out standardization or normalization preprocessing operation on original data of the source item and the target item;

step 3), on the basis of obtaining the preliminary feature representation in the step 2), selecting a part of samples which are distributed most closely to the hidden layer feature representation of the target project sample from the hidden layer feature representation of the source project sample as a verification set by means of a migration cross-validation method, and taking the rest source project samples as a training set;

step 4), oversampling processing is carried out on the training set samples;

step 5), continuing to perform supervised fine tuning on the training set subjected to the oversampling processing in the step 4), and selecting a model hyperparameter and stopping a strategy in advance by virtue of the prediction performance on the verification set to complete the training of the model;

step 6), after the training of the migration self-encoder is finished, inputting the sample data of the target project after pretreatment to the migration self-encoder, and obtaining a final prediction result by a label coding layer of the migration self-encoder;

the migration self-encoder is a self-encoder with a double-encoding layer structure; the double coding layers are a characteristic coding layer and a label coding layer; the first layer of coding layer is a feature coding layer and is responsible for coding the feature vectors of all samples in the source project and the target project into hidden layer feature representation, and the label coding layer realizes the classification of the samples on the basis of the hidden layer feature representation;

the migration self-encoder adopts different forms of loss functions; the model pre-training process and the fine-tuning process respectively realize two training modes of no supervision and supervised by adjusting the loss function to enable the loss function to contain or not contain the label loss item.

2. The method of claim 1, wherein the cross-project software defect prediction method based on supervised expression learning comprises: in the unsupervised training mode, label information is not introduced in the training process, and the loss function consists of a reconstruction error term and a hidden layer characteristic distribution difference term; by minimizing the loss function, the network can learn the hidden layer signature representation of all samples.

3. The method of claim 1, wherein the cross-project software defect prediction method based on supervised expression learning comprises: the supervised training mode is that the training process introduces the label information of the source item sample, and the loss function at the moment consists of 4 items of contents, including a reconstruction error item, a hidden layer feature distribution difference item, a label loss item of the source item sample and a regular loss item.

4. The method of claim 1, wherein the cross-project software defect prediction method based on supervised expression learning comprises: the migration cross validation method in the step 3) selects part of training data which is close to the target project data distribution as a validation set according to the feature distribution difference, and takes the rest of the training data as a training set; the feature transform used is derived by migrating the self-encoder.