CN111198820A - Cross-project software defect prediction method based on shared hidden layer self-encoder - Google Patents

Cross-project software defect prediction method based on shared hidden layer self-encoder Download PDF

Info

Publication number
CN111198820A
CN111198820A CN202010001850.9A CN202010001850A CN111198820A CN 111198820 A CN111198820 A CN 111198820A CN 202010001850 A CN202010001850 A CN 202010001850A CN 111198820 A CN111198820 A CN 111198820A
Authority
CN
China
Prior art keywords
class
samples
theta
encoder
hidden layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010001850.9A
Other languages
Chinese (zh)
Other versions
CN111198820B (en
Inventor
荆晓远
李娟娟
吴飞
孙莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202010001850.9A priority Critical patent/CN111198820B/en
Publication of CN111198820A publication Critical patent/CN111198820A/en
Application granted granted Critical
Publication of CN111198820B publication Critical patent/CN111198820B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Abstract

The invention discloses a cross-project software defect prediction method based on a shared hidden layer self-encoder, which comprises the following steps of firstly, preprocessing a data set and dividing a training set and a test set; secondly, extracting features by adopting a self-encoder with a sharing mechanism, and respectively extracting depth features of a training set and a testing set; and finally, introducing a focus loss function and training a classifier. The invention solves the problem of characteristic distribution difference in cross-project software defect prediction and provides a focus loss sharing hidden layer-based self-encoder technology for the first time, so that different data distributions become more similar, different weights are distributed to different types of samples by utilizing a focus loss learning technology to solve class imbalance, and different weights are given to samples which are easy to classify and samples which are difficult to classify so that a classifier can better learn samples which are difficult to classify.

Description

Cross-project software defect prediction method based on shared hidden layer self-encoder
Technical Field
The invention belongs to the field of software engineering, and particularly relates to a cross-project software defect prediction method based on a shared hidden layer self-encoder.
Background
Software defect prediction is a research hotspot in the field of software engineering. The main aim of the method is to discover defects in software in advance in the early process of software development and improve the quality of software products. Most previous studies focused on the problem of intra-project defect prediction, mainly training a prediction model using a portion of historical data of the same project, and then testing the model's ability to predict defects using the remaining data of the same project. However, for a newly launched project, there is not enough historical data to train the model, and the performance of the in-project defect prediction will be poor. Cross-project defect prediction is therefore a viable approach when there is not enough historical defect data to build an accurate prediction model. Cross-project defect prediction is to train a prediction model using historical data of other projects and perform defect prediction on the new project. But the prediction performance of the method is still poor, the main reason is that the data distribution difference between the source project and the target project exists, and if the data distribution difference is smaller, the cross-project defect prediction effect is better. In addition, because the data set itself has class imbalance, that is, the number of non-defective classes is much larger than that of defective classes, the class imbalance problem may reduce the prediction performance of the model, which results in that the model may more easily identify the samples of non-defective classes, and may exhibit poor prediction capability for the defective samples. Therefore, the invention is mainly provided for solving the problem of data distribution difference and class imbalance in software defect prediction.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects in the prior art, a cross-project software defect prediction method based on a shared hidden layer self-encoder is provided, and a shared hidden layer self-encoder method is introduced to solve the problem of data distribution difference.
The invention content is as follows: the invention relates to a cross-project software defect prediction method based on a shared hidden layer self-encoder, which comprises the following steps:
(1) dividing a pre-acquired data set into a training set and a testing set, and performing data preprocessing;
(2) extracting features by adopting a self-encoder with a sharing mechanism, and respectively extracting depth features of a training set and a test set;
(3) and (4) introducing a focus loss function and training a classifier.
Further, the preprocessing of the step (1) is realized by the following formula:
Figure BDA0002353780240000021
wherein, PiThe characteristic value after normalization preprocessing is carried out on a given characteristic x, wherein x is the given characteristic, the maximum value and the minimum value of the given characteristic x are respectively expressed as max (x) and min (x), and x isiFor each eigenvalue of the characteristic x.
Further, the self-encoder of the sharing mechanism in step (2) is a self-encoder that adds a sharing parameter mechanism to obtain a sharing hidden layer, and the implementation process is as follows: by minimizing the reconstruction error L (theta)all) To obtain a depth characterization L (theta) of the hidden layerall),L(θall) Comprises two parts: l (theta)tr) And L (theta)te),L(θtr) And L (theta)te) Is defined as follows:
Figure BDA0002353780240000022
Figure BDA0002353780240000023
Figure BDA0002353780240000024
Figure BDA0002353780240000025
wherein, L (theta)tr) Representing the reconstruction error between the input and output of the training data for the Euclidean distance of the input and output of the training data, L (theta)tr) Is composed of three parts including reconstruction error loss term and intra-class loss term
Figure BDA0002353780240000026
And inter-class loss term
Figure BDA0002353780240000027
Figure BDA0002353780240000028
In order to be a global inter-class loss term,
Figure BDA0002353780240000029
is a local inter-class loss term, L (θ)te) The euclidean distance between the input and output of the test data, to represent the reconstruction error between the input and output of the test data,
Figure BDA00023537802400000210
means that for each sample with class 0, the mean of the k nearest neighbor samples from the sample with class 1 is selected;
Figure BDA00023537802400000211
is to select the mean of the k nearest neighbor samples from the class 0 samples for each sample of class 1.
Figure BDA0002353780240000031
Refers to the features of the training data set after decoding,
Figure BDA0002353780240000032
is a feature of the test data set after decoding,
Figure BDA0002353780240000033
are samples of all classes 0 in the training data,
Figure BDA0002353780240000034
are samples of all classes 1 in the training data,
Figure BDA0002353780240000035
and
Figure BDA0002353780240000036
the decoded training data with class 0 and class 1 respectively,
Figure BDA0002353780240000037
and
Figure BDA0002353780240000038
respectively, a sample mean value of class 0 and a sample mean value of class 1 after decoding the training data, while optimizing L (theta)tr) And L (theta)te) The final objective function is expressed as follows:
L(θall)=L(θtr)+rL(θte) (9)
wherein, L (theta)all) All parameters theta of the network that need to be optimized for the depth characteristics of the hidden layerallComprises the following steps:
Figure BDA0002353780240000039
r is a regularization parameter.
Further, the focus loss function in step (3) is implemented by the following formula:
Figure BDA00023537802400000310
wherein N istrRepresenting the number of training samples, c representing the class of labels, k representing the number of label classes,
Figure BDA00023537802400000311
a label representing the authenticity of the tag,
Figure BDA00023537802400000312
representing the probability of a predictive label, g (-) is an activation function, u is a sample class weight, a small weight u (0 < u < 1) is given for non-defective class samples, a large weight 1-u (0 < u < 1) is given for defective class samples, and the sum of the two weights is 1,
Figure BDA00023537802400000313
for samples that are more easily classified in defect prediction, weights are
Figure BDA00023537802400000314
The smaller the value is, the more difficult samples to classify in the defect prediction, the weight
Figure BDA00023537802400000315
The larger the value.
Has the advantages that: compared with the prior art, the invention has the beneficial effects that: the invention solves the problem of characteristic distribution difference in cross-project software defect prediction and provides a focus loss sharing hidden layer-based self-encoder technology for the first time, so that different data distributions become more similar, different weights are distributed to samples of different classes by utilizing a focus loss learning technology to solve class imbalance, different weights are given to samples easy to classify and samples difficult to classify to enable a classifier to better learn samples difficult to classify, and the problems of data distribution difference in software defect prediction and class imbalance of a data set are solved; experimental results on 10 items of the PROMISE dataset show that the proposed method achieves ideal defect prediction.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings: as shown in fig. 1, a cross-project software defect prediction method based on a shared hidden layer self-encoder includes the following steps:
step 1, dividing a training data set and a testing data set, and performing data preprocessing on the data set, wherein the specific method comprises the following steps: first, a PROMISE data set is selected, which has 20 basic metrics, and these 20 basic metrics are not in the same order of magnitude, so we should use the min-max data normalization method to convert all the metrics to the interval of 0 to 1. Given a feature x, its maximum and minimum values are represented as: max (x) and min (x). For each of the features xCharacteristic value xiThe data preprocessing can be expressed as follows:
Figure BDA0002353780240000041
and 2, extracting features by using an improved self-encoder. The method adopts a shared hidden layer self-encoder to extract features, and adds a sharing mechanism in the original self-encoding to solve the problem of data distribution difference in cross-project defect prediction. Suppose that
Figure BDA0002353780240000042
XtrAnd XteRespectively a training data set and a test data set. X is an element of { X ∈ }tr∪XteIs the set of shuffled training and test data. Where N is the number of features, NtrAnd NteThe number of instances in the training set and test set, respectively. Conventional self-encoders attempt to find a common depth signature representation from the input data, thereby making the output as equal as possible to the input. Usually comprising two stages, encoding and decoding, given input data xi∈XtrThen the encoding and decoding stages are represented respectively as follows:
and (3) an encoding stage: y (x)i)=f(w1xi+b1) (2)
And a decoding stage:
Figure BDA0002353780240000043
wherein xiIs an input to the computer system that is,
Figure BDA0002353780240000044
is the output, y (x)i) Is the output of the hidden layer, f (-) is a non-linear activation function, usually a sigmoid function, w1∈Rm×nAnd w2∈Rn×mIs a weight matrix, b1∈RmAnd b2∈RnIf it is a deviation, the network parameters from the encoder can be expressed as: θ ═ w1,b1,w2,b2The updated optimization of the parameters can be achieved by minimizing the reconstruction error function L (θ), which is minimized by Adam optimizer during the training of the self-encoder, and is expressed as follows:
Figure BDA0002353780240000051
in order to solve the problem of data distribution difference in cross-project defect prediction, the invention improves the original self-encoder, and adds a shared parameter mechanism to obtain the self-encoder of a shared hidden layer. By minimizing the reconstruction error L (theta)all) To obtain a depth characterization L (theta) of the hidden layerall),L(θall) Comprises 2 parts: l (theta)tr) And L (theta)te)。L(θtr) The reconstruction errors of training data are expressed by calculating Euclidean distances of input and output of the training data, and meanwhile, in order to fully utilize label information in source data, intra-class loss, global inter-class loss and local inter-class loss are added, so that the aim of maximizing the inter-class distance and minimizing the intra-class distance of the data in a source domain is achieved in the characteristic learning process. L (theta)tr) Is composed of three parts including reconstruction error loss term and intra-class loss term
Figure BDA0002353780240000052
And inter-class loss term
Figure BDA0002353780240000053
Wherein the reconstruction error loss term is to reconstruct the input with better output; the intra-class loss term is to keep samples of the same class in the source data sufficiently close to the class center to achieve intra-class minimization; fully considering global inter-class loss terms
Figure BDA0002353780240000054
And local inter-class loss terms
Figure BDA0002353780240000055
Figure BDA0002353780240000056
In order to have the class centers of the two classes sufficiently distant,
Figure BDA0002353780240000057
in order to make each sample with the class of 0(1) as far as the center of the nearest k adjacent samples with the class of 1(0), the distance between the samples is as far as possible, so that the purpose of maximizing the inter-class distance is achieved. L (theta)te) The reconstruction error between the input and output of the test data is represented by calculating the euclidean distance between them. L (theta)tr) And L (theta)te) Is defined as follows:
Figure BDA0002353780240000058
Figure BDA0002353780240000059
Figure BDA00023537802400000510
Figure BDA0002353780240000061
wherein
Figure BDA0002353780240000062
Means that for each sample with class 0, the mean of the k nearest neighbor samples from the sample with class 1 is selected; in the same way
Figure BDA0002353780240000063
Similar to the previous meaning.
Figure BDA0002353780240000064
Refers to the features of the training data set after decoding,
Figure BDA0002353780240000065
is a bit of the test data set after decodingAnd (5) carrying out characterization.
Figure BDA0002353780240000066
Are samples of all classes 0 in the training data,
Figure BDA0002353780240000067
are samples of all classes 1 in the training data.
Figure BDA0002353780240000068
And
Figure BDA0002353780240000069
the decoded training data have a class of 0 and a class of 1, respectively.
Figure BDA00023537802400000610
And
Figure BDA00023537802400000611
the sample mean of class 0 and the sample mean of class 1, respectively, after decoding the training data. Combining the above two formulas, optimizing L (theta) simultaneouslytr) And L (theta)te) The final objective function is expressed as follows:
L(θall)=L(θtr)+rL(θte) (9)
all parameters theta of the network that need to be optimizedallComprises the following steps:
Figure BDA00023537802400000612
r is a regularization parameter that facilitates regularization of the behavior of the self-encoder. The purpose of adding the regularization term is to make the feature distributions of the training data and the test data more and more similar by changing the value of r.
And 3, introducing a focus loss technology and training the improved focus loss classifier. In order to enable the network to discover and learn the characteristics of defective modules, the network is able to distinguish between defective modules and non-defective modules, since the number of defective modules is small due to the class imbalance problem of the data set itself. The invention introduces a focus loss technology, samples of different types are distributed with different weights in the training process to balance the samples, whether the samples are easy to classify is considered, the samples easy to classify are endowed with smaller weights, and the samples easy to be wrongly classified have larger weights, so that the class imbalance is relieved. And finally, training the classifier by using the deep characteristic representation of the training data obtained in the step 2. The classifier penalty C may use a cross-entropy penalty function to compute the similarity between the true label and the predicted label, defined as follows:
Figure BDA00023537802400000613
wherein N istrRepresenting the number of training samples, c representing the class of labels, k representing the number of label classes, where the number of classes is 2,
Figure BDA00023537802400000614
a label representing the authenticity of the tag,
Figure BDA00023537802400000615
representing the probability of predicting the label, g (-) is the activation function.
Based on the classifier, we add two-part weight u and
Figure BDA0002353780240000071
a focus loss function is proposed. u mainly solves the class imbalance problem, and for the class with the large number of samples (non-defective class), a small weight u (0 < u < 1) is given, and for the class with the small number of samples (defective class), a large weight 1-u (0 < u < 1) is given, and the sum of the two weights is 1, so that the numbers of samples of the two different classes are balanced.
Figure BDA0002353780240000072
The method mainly solves the problem of difficult classification of samples in the process of defect prediction learning, and for classes with a large number of samples, namely defect-free classes, a classifier can learn and judge which class belongs to more easily so as to obtain the classesA relatively large probability value
Figure BDA0002353780240000073
Thus the easier the samples to classify in defect prediction, the weights
Figure BDA0002353780240000074
The smaller the value. Vice versa, the more difficult samples to classify in defect prediction, the weights
Figure BDA0002353780240000075
The larger the value is, the more the classifier places more attention to samples which are difficult to classify, so that the classifier can better learn the characteristics of the samples which are difficult to classify. The final focus loss function can thus be expressed in the form:
Figure BDA0002353780240000076
in order to verify whether the algorithm has good superiority or not, the cross-project software defect prediction algorithm based on the focus loss shared hidden layer self-encoder is compared with other 5 cross-project defect prediction methods, namely TCA +, TDS, Dycom, LT and SHLA (cross-project defect prediction algorithm of the shared hidden layer self-encoder without focus loss). The 10 items of the project were compared and verified as experimental data, respectively, as shown in table 1: where # instance represents the number of instances, # defect represents the number of defective instances, and% defect represents the proportion of defective instances to all instances.
Table 1 10 entries in the project data set used in the experiment
Datasets #instance #defect %defect
ant-1.7 745 166 22.28
camel-1.6 965 188 19.48
jedit-3.2 272 90 33.09
log4j-1.0 135 34 25.19
lucene-2.0 195 91 46.67
poi-1.5 237 141 59.49
redaktor 176 27 15.34
synapse-1.0 157 16 10.19
xalan-2.6 885 411 46.44
xerces-1.3 453 69 15.23
The evaluation indexes of the prediction model are mainly F-measure and Accuracy. Can be represented by the TP, FN, FP, and TN defined in Table 2, as shown in Table 2:
TABLE 2 confusion matrix
Figure BDA0002353780240000081
And (3) recall: the classifier predicts the defective samples in proportion to all the defective samples, i.e., call TP/(TP + FN). precision: the classifier predicts the correct prediction proportion of the defective samples, namely precision TP/(TP + FP), and the ratio evaluates the correctness degree of the model predicting the defective modules. The F-measure index is a weighted average of the recill and precision, i.e., F-measure ═ (2 precision:)/(precision + precision). The Accuracy index is an index for evaluating the degree to which both defective and non-defective modules are correctly classified, i.e., Accuracy ═ TP + TN)/(TP + TN + FP + FN). The larger the numerical values of F-measure and Accuracy indicate the better the prediction performance of the software defect prediction model.
The experimental setup here is to select 1 of 10 items in the project from the project plan as test data (or target item), and the remaining 9 items in turn as source items (or training items). Thus there are 9 cross-project combinations for each target project, and there are 90 possible cross-project combinations for 10 projects. In the training process of the self-encoder, the model has 4 hidden layers, and the number of nodes of each hidden layer is set as: 20-15-10-10-2, where 20 refers to the characteristic dimension of the input data and 2 refers to the characteristic dimension of the incoming softmax classifier data. In the training process of the model, a ReLU activation function is adopted for each layer and the setting of the number of layers is empirically obtained, and an Adam optimizer is used when parameter optimization is performed. Each mini-batch in the experiment was set to 64, and the range setting for the hyperparameter r was: r ∈ {0.1,0.5,1,5,10,15}, and a good effect was obtained when r ═ 10.
To verify whether the algorithm herein performs well in several comparison algorithms, experiments were performed on 10 items of the project, F-measure is shown in table 3, and Accuracy is shown in table 4:
TABLE 3 experimental results of our model and 4 comparison algorithms on F-measure
Figure BDA0002353780240000082
Figure BDA0002353780240000091
As can be seen from the experimental results in table 3: the F-measure values of our model exceed other 5-group contrast algorithms and the variation of the F-measure values ranges from 0.257 to 0.649, and our model improves the F-measure results by at least 0.019-0.418.
TABLE 4 results of our model experiments with 5 comparison algorithms on Accuracy
target TDS TCA+ Dycom LT SHLA Ours
ant-1.7 0.680 0.684 0.674 0.675 0.631 0.721
camel-1.6 0.742 0.618 0.769 0.722 0.731 0.639
jedit-3.2 0.593 0.663 0.710 0.599 0.702 0.722
log4g-1.0 0.715 0.657 0.763 0.726 0.711 0.716
lucene-2.0 0.538 0.621 0.600 0.533 0.621 0.637
poi-1.5 0.559 0.576 0.435 0.527 0.611 0.618
redaktor 0.579 0.556 0.386 0.648 0.361 0.495
synapse-1.0 0.761 0.641 0.796 0.643 0.592 0.613
xalan-2.6 0.417 0.591 0.603 0.531 0.582 0.611
xerces-1.3 0.714 0.627 0.764 0.757 0.810 0.814
average 0.630 0.623 0.650 0.636 0.635 0.659
improved 0.029 0.036 0.009 0.023 0.024 -
As can be seen from the experimental results in table 4: the Accuracy values of our model are improved somewhat over the other 5-group comparison algorithms, the Accuracy mean of our model is 0.659, and the results of Accuracy are improved by at least 0.009 ═ 0.659 to 0.650.
Through the above experiments, it can be seen that: TCA +, TDS, Dycom, LT and SHLA algorithms can have better F-measure value and Accuracy value on certain projects, but the model provided by the invention has better F-measure average value and Accuracy average value on the whole, and the effect is better than that of the former 5 algorithms, thereby indicating the superiority of the method provided by the invention.
In addition to the above embodiments, the present invention may have other embodiments. All technical solutions formed by adopting equivalent substitutions or equivalent transformations fall within the protection scope of the claims of the present invention.

Claims (4)

1. A cross-project software defect prediction method based on a shared hidden layer self-encoder is characterized by comprising the following steps:
(1) dividing a pre-acquired data set into a training set and a testing set, and performing data preprocessing;
(2) extracting features by adopting a self-encoder with a sharing mechanism, and respectively extracting depth features of a training set and a test set;
(3) and (4) introducing a focus loss function and training a classifier.
2. The method for cross-project software defect prediction based on shared hidden layer self-encoder as claimed in claim 1, wherein the preprocessing of step (1) is implemented by the following formula:
Figure FDA0002353780230000011
wherein, PiThe characteristic value after normalization preprocessing is carried out on a given characteristic x, wherein x is the given characteristic, the maximum value and the minimum value of the given characteristic x are respectively expressed as max (x) and min (x), and x isiFor each eigenvalue of the characteristic x.
3. The method according to claim 1, wherein the shared mechanism auto-encoder in step (2) is an auto-encoder that adds a shared parameter mechanism to obtain a shared hidden layer, and the implementation process is as follows: by minimizing the reconstruction error L (theta)all) To obtain a depth characterization L (theta) of the hidden layerall),L(θall) Comprises two parts: l (theta)tr) And L (theta)te),L(θtr) And L (theta)te) Is defined as follows:
Figure FDA0002353780230000012
Figure FDA0002353780230000013
Figure FDA0002353780230000014
Figure FDA0002353780230000015
wherein, L (theta)tr) Representing the reconstruction error between the input and output of the training data for the Euclidean distance of the input and output of the training data, L (theta)tr) Is composed of three parts including reconstruction error loss term and intra-class loss term
Figure FDA0002353780230000021
And inter-class loss term
Figure FDA0002353780230000022
Figure FDA0002353780230000023
In order to be a global inter-class loss term,
Figure FDA0002353780230000024
is a local inter-class loss term, L (θ)te) The euclidean distance between the input and output of the test data, to represent the reconstruction error between the input and output of the test data,
Figure FDA0002353780230000025
means that for each sample with class 0, the mean of the k nearest neighbor samples from the sample with class 1 is selected;
Figure FDA0002353780230000026
is to select the mean of the k nearest neighbor samples from the class 0 samples for each sample with class 1,
Figure FDA0002353780230000027
refers to the features of the training data set after decoding,
Figure FDA0002353780230000028
is a feature of the test data set after decoding,
Figure FDA0002353780230000029
are samples of all classes 0 in the training data,
Figure FDA00023537802300000210
are samples of all classes 1 in the training data,
Figure FDA00023537802300000211
and
Figure FDA00023537802300000212
the decoded training data with class 0 and class 1 respectively,
Figure FDA00023537802300000213
and
Figure FDA00023537802300000214
respectively, a sample mean value of class 0 and a sample mean value of class 1 after decoding the training data, while optimizing L (theta)tr) And L (theta)te) The final objective function is expressed as follows:
L(θall)=L(θtr)+rL(θte) (9)
wherein, L (theta)all) All parameters theta of the network that need to be optimized for the depth characteristics of the hidden layerallComprises the following steps:
Figure FDA00023537802300000215
r is a regularization parameter.
4. The method for cross-project software defect prediction based on shared hidden layer self-encoder as claimed in claim 1, wherein the focus loss function in step (3) is implemented by the following formula:
Figure FDA00023537802300000216
wherein N istrRepresenting the number of training samples, c representing the class of labels, k representing the number of label classes,
Figure FDA00023537802300000217
a label representing the authenticity of the tag,
Figure FDA00023537802300000218
representing the probability of a predictive label, g (-) is an activation function, u is a sample class weight, a small weight u (0 < u < 1) is given for non-defective class samples, a large weight 1-u (0 < u < 1) is given for defective class samples, and the sum of the two weights is 1,
Figure FDA00023537802300000219
classifying sample difficultyWeights, samples that are easier to classify in defect prediction, weights
Figure FDA00023537802300000220
The smaller the value is, the more difficult samples to classify in the defect prediction, the weight
Figure FDA0002353780230000031
The larger the value.
CN202010001850.9A 2020-01-02 2020-01-02 Cross-project software defect prediction method based on shared hidden layer self-encoder Active CN111198820B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010001850.9A CN111198820B (en) 2020-01-02 2020-01-02 Cross-project software defect prediction method based on shared hidden layer self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010001850.9A CN111198820B (en) 2020-01-02 2020-01-02 Cross-project software defect prediction method based on shared hidden layer self-encoder

Publications (2)

Publication Number Publication Date
CN111198820A true CN111198820A (en) 2020-05-26
CN111198820B CN111198820B (en) 2022-08-26

Family

ID=70746714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010001850.9A Active CN111198820B (en) 2020-01-02 2020-01-02 Cross-project software defect prediction method based on shared hidden layer self-encoder

Country Status (1)

Country Link
CN (1) CN111198820B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015659A (en) * 2020-09-02 2020-12-01 三维通信股份有限公司 Prediction method and device based on network model
CN112199280A (en) * 2020-09-30 2021-01-08 三维通信股份有限公司 Defect prediction method and apparatus, storage medium, and electronic apparatus

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108446711A (en) * 2018-02-01 2018-08-24 南京邮电大学 A kind of Software Defects Predict Methods based on transfer learning
CN109710512A (en) * 2018-12-06 2019-05-03 南京邮电大学 Neural network software failure prediction method based on geodesic curve stream core

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108446711A (en) * 2018-02-01 2018-08-24 南京邮电大学 A kind of Software Defects Predict Methods based on transfer learning
CN109710512A (en) * 2018-12-06 2019-05-03 南京邮电大学 Neural network software failure prediction method based on geodesic curve stream core

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015659A (en) * 2020-09-02 2020-12-01 三维通信股份有限公司 Prediction method and device based on network model
CN112199280A (en) * 2020-09-30 2021-01-08 三维通信股份有限公司 Defect prediction method and apparatus, storage medium, and electronic apparatus
WO2022068200A1 (en) * 2020-09-30 2022-04-07 三维通信股份有限公司 Defect prediction method and apparatus, storage medium, and electronic device

Also Published As

Publication number Publication date
CN111198820B (en) 2022-08-26

Similar Documents

Publication Publication Date Title
Lei A feature selection method based on information gain and genetic algorithm
CN110751186B (en) Cross-project software defect prediction method based on supervised expression learning
CN111198820B (en) Cross-project software defect prediction method based on shared hidden layer self-encoder
CN108877947B (en) Depth sample learning method based on iterative mean clustering
CN111325264A (en) Multi-label data classification method based on entropy
CN114387473A (en) Small sample image classification method based on base class sample characteristic synthesis
CN114863175A (en) Unsupervised multi-source partial domain adaptive image classification method
Liu et al. A quantitative study of the effect of missing data in classifiers
Chen Financial Statement Fraud Detection based on Integrated Feature Selection and Imbalance Learning
CN117349786A (en) Evidence fusion transformer fault diagnosis method based on data equalization
Zeng et al. Research on audit opinion prediction of listed companies based on sparse principal component analysis and kernel fuzzy clustering algorithm
Jie et al. Incremental learning algorithm of data complexity based on KNN classifier
CN115600913A (en) Main data identification method for intelligent mine
CN113379037A (en) Multi-label learning method based on supplementary label collaborative training
CN112418987B (en) Method and system for rating credit of transportation unit, electronic device and storage medium
CN115185732A (en) Software defect prediction method fusing genetic algorithm and deep neural network
Sun Application of GA-BP neural network in online education quality evaluation in colleges and universities
CN114357869A (en) Multi-objective optimization agent model design method and system based on data relation learning and prediction
Kashef et al. MLIFT: enhancing multi-label classifier with ensemble feature selection
CN117408247B (en) Intelligent manufacturing triplet extraction method based on relational pointer network
CN117648921B (en) Cross-theme composition automatic evaluation method and system based on paired double-layer countermeasure alignment
CN116402241B (en) Multi-model-based supply chain data prediction method and device
Lu et al. A quality evaluation model for diesel engine using RBF neural network based on trial run data
CN116383342B (en) Robust cross-domain text retrieval method under noise label
CN114501525B (en) Wireless network interruption detection method based on condition generation countermeasure network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 210003 Gulou District, Jiangsu, Nanjing new model road, No. 66

Applicant after: NANJING University OF POSTS AND TELECOMMUNICATIONS

Address before: Yuen Road Qixia District of Nanjing City, Jiangsu Province, No. 9 210046

Applicant before: NANJING University OF POSTS AND TELECOMMUNICATIONS

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant