CN111198820B - Cross-project software defect prediction method based on shared hidden layer self-encoder - Google Patents

Cross-project software defect prediction method based on shared hidden layer self-encoder Download PDF

Info

Publication number
CN111198820B
CN111198820B CN202010001850.9A CN202010001850A CN111198820B CN 111198820 B CN111198820 B CN 111198820B CN 202010001850 A CN202010001850 A CN 202010001850A CN 111198820 B CN111198820 B CN 111198820B
Authority
CN
China
Prior art keywords
class
samples
encoder
theta
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010001850.9A
Other languages
Chinese (zh)
Other versions
CN111198820A (en
Inventor
荆晓远
李娟娟
吴飞
孙莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202010001850.9A priority Critical patent/CN111198820B/en
Publication of CN111198820A publication Critical patent/CN111198820A/en
Application granted granted Critical
Publication of CN111198820B publication Critical patent/CN111198820B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a cross-project software defect prediction method based on a shared hidden layer self-encoder, which comprises the following steps of firstly, preprocessing a data set and dividing a training set and a test set; secondly, extracting features by adopting a self-encoder with a sharing mechanism, and respectively extracting the depth features of a training set and a testing set; and finally, introducing a focus loss function and training a classifier. The invention solves the problem of characteristic distribution difference in cross-project software defect prediction and provides a focus loss sharing hidden layer-based self-encoder technology for the first time, so that different data distributions become more similar, different weights are distributed to different types of samples by utilizing a focus loss learning technology to solve class imbalance, and different weights are given to samples which are easy to classify and samples which are difficult to classify so that a classifier can better learn samples which are difficult to classify.

Description

Cross-project software defect prediction method based on shared hidden layer self-encoder
Technical Field
The invention belongs to the field of software engineering, and particularly relates to a cross-project software defect prediction method based on a shared hidden layer self-encoder.
Background
Software defect prediction is a research hotspot in the field of software engineering. The main aim of the method is to discover defects in software in advance in the early process of software development and improve the quality of software products. Most previous studies focused on the problem of intra-project defect prediction, mainly training a prediction model using a portion of historical data of the same project, and then testing the model's ability to predict defects using the remaining data of the same project. However, for a newly launched project, there is not enough historical data to train the model, and the performance of the in-project defect prediction will be poor. Cross-project defect prediction is therefore a viable approach when there is not enough historical defect data to build an accurate prediction model. Cross-project defect prediction is to train a prediction model using historical data of other projects and perform defect prediction on the new project. But the prediction performance of the method is still poor, the main reason is that the data distribution difference between the source project and the target project exists, and if the data distribution difference is smaller, the cross-project defect prediction effect is better. In addition, because the data set itself has class imbalance, that is, the number of non-defective classes is much larger than that of defective classes, the class imbalance problem will reduce the prediction performance of the model, so that the model can more easily identify the samples of non-defective classes, and the model will have poor prediction performance for the defective samples. Therefore, the invention is mainly provided for solving the problem of data distribution difference and class imbalance in software defect prediction.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects in the prior art, a cross-project software defect prediction method based on a shared hidden layer self-encoder is provided, and a shared hidden layer self-encoder method is introduced to solve the problem of data distribution difference.
The invention content is as follows: the invention relates to a cross-project software defect prediction method based on a shared hidden layer self-encoder, which comprises the following steps:
(1) dividing a pre-acquired data set into a training set and a testing set, and performing data preprocessing;
(2) extracting features by adopting a self-encoder with a sharing mechanism, and respectively extracting depth features of a training set and a test set;
(3) and (4) introducing a focus loss function and training a classifier.
Further, the preprocessing of the step (1) is realized by the following formula:
Figure BDA0002353780240000021
wherein, P i The characteristic value after normalization preprocessing is carried out on a given characteristic x, x is the given characteristic, the maximum value and the minimum value of the given characteristic x are respectively expressed as max (x) and min (x), x i For each eigenvalue of the characteristic x.
Further, the self-encoder of the sharing mechanism in step (2) is a self-encoder that adds a sharing parameter mechanism to obtain a sharing hidden layer, and the implementation process is as follows: by minimizing the reconstruction error L (theta) all ) To obtain a depth characterization L (theta) of the hidden layer all ),L(θ all ) Comprises two parts: l (theta) tr ) And L (theta) te ),L(θ tr ) And L (θ) te ) Defined in the form:
Figure BDA0002353780240000022
Figure BDA0002353780240000023
Figure BDA0002353780240000024
Figure BDA0002353780240000025
wherein, L (theta) tr ) Euclidean distance for input and output of training data to represent reconstruction error between input and output of training data, L (θ) tr ) Is composed of three parts including reconstruction error loss term and intra-class loss term
Figure BDA0002353780240000026
And inter-class loss term
Figure BDA0002353780240000027
Figure BDA0002353780240000028
In order to be a global inter-class loss term,
Figure BDA0002353780240000029
is a local inter-class loss term, L (θ) te ) The euclidean distance between the input and output of the test data, to represent the reconstruction error between the input and output of the test data,
Figure BDA00023537802400000210
means that for each sample of class 0, a distance is selected among the samples of class 1The mean of the k nearest neighbor samples to it;
Figure BDA00023537802400000211
is to select the mean of the k nearest neighbor samples from the class 0 samples for each sample of class 1.
Figure BDA0002353780240000031
Refers to the features of the training data set after decoding,
Figure BDA0002353780240000032
is a feature of the test data set after decoding,
Figure BDA0002353780240000033
are samples of all classes 0 in the training data,
Figure BDA0002353780240000034
are samples of all classes 1 in the training data,
Figure BDA0002353780240000035
and
Figure BDA0002353780240000036
the decoded training data with class 0 and class 1 respectively,
Figure BDA0002353780240000037
and
Figure BDA0002353780240000038
respectively, a sample mean value of class 0 and a sample mean value of class 1 after decoding the training data, while optimizing L (theta) tr ) And L (theta) te ) The final objective function is expressed as follows:
L(θ all )=L(θ tr )+rL(θ te ) (9)
wherein, L (theta) all ) All parameters theta of the network that need to be optimized for the depth characteristics of the hidden layer all Comprises the following steps:
Figure BDA0002353780240000039
r is a regularization parameter.
Further, the focus loss function in step (3) is implemented by the following formula:
Figure BDA00023537802400000310
wherein N is tr Representing the number of training samples, c representing the class of labels, k representing the number of label classes,
Figure BDA00023537802400000311
a label representing the authenticity of the tag,
Figure BDA00023537802400000312
representing the probability of a predictive label, g (-) is an activation function, u is a sample class weight, a small weight u (0 < u < 1) is given for non-defective class samples, a large weight 1-u (0 < u < 1) is given for defective class samples, and the sum of the two weights is 1,
Figure BDA00023537802400000313
for samples that are more easily classified in defect prediction, weights are
Figure BDA00023537802400000314
The smaller the value is, the more difficult samples to classify in the defect prediction, the weight
Figure BDA00023537802400000315
The larger the value.
Has the beneficial effects that: compared with the prior art, the invention has the following beneficial effects: the invention solves the problem of characteristic distribution difference in cross-project software defect prediction and provides a focus loss sharing hidden layer-based self-encoder technology for the first time, so that different data distributions become more similar, different weights are distributed to samples of different classes by utilizing a focus loss learning technology to solve class imbalance, different weights are given to samples easy to classify and samples difficult to classify to enable a classifier to better learn samples difficult to classify, and the problems of data distribution difference in software defect prediction and class imbalance of a data set are solved; experimental results on 10 items of the PROMISE dataset show that the proposed method achieves ideal defect prediction.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings: as shown in fig. 1, a cross-project software defect prediction method based on a shared hidden layer self-encoder includes the following steps:
step 1, dividing a training data set and a testing data set, and performing data preprocessing on the data set, wherein the specific method comprises the following steps: first, a PROMISE data set is selected, which has 20 basic metrics, and these 20 basic metrics are not in the same order of magnitude, so we should use the min-max data normalization method to convert all the metrics to the interval of 0 to 1. Given a feature x, its maximum and minimum values are represented as: max (x) and min (x). For each eigenvalue x of the eigenvalue x i The data preprocessing can be expressed as follows:
Figure BDA0002353780240000041
and 2, extracting features by using an improved self-encoder. A shared hidden layer self-encoder is adopted to extract features, and a sharing mechanism is added in original self-encoding to solve the problem of data distribution difference in cross-project defect prediction. Suppose that
Figure BDA0002353780240000042
X tr And X te Respectively a training data set and a test data set. X is an element of { X ∈ } tr ∪X te Is the set of shuffled training and test data. Where N is the number of features, N tr And N te The number of instances in the training set and test set, respectively. Conventional self-encoders attempt to find a common depth signature representation from the input data, thereby making the output as equal as possible to the input. Usually comprising two stages, encoding and decoding, given input data x i ∈X tr Then the encoding and decoding stages are represented respectively as follows:
and (3) an encoding stage: y (x) i )=f(w 1 x i +b 1 ) (2)
And a decoding stage:
Figure BDA0002353780240000043
wherein x is i Is an input to the computer system that is,
Figure BDA0002353780240000044
is the output, y (x) i ) Is the output of the hidden layer, f (-) is a non-linear activation function, usually a sigmoid function, w 1 ∈R m×n And w 2 ∈R n×m Is a weight matrix, b 1 ∈R m And b 2 ∈R n If it is a deviation, the network parameters from the encoder can be expressed as: θ ═ w 1 ,b 1 ,w 2 ,b 2 The updated optimization of the parameters can be achieved by minimizing the reconstruction error function L (θ), which is minimized by Adam optimizer during the training of the self-encoder, and is expressed as follows:
Figure BDA0002353780240000051
in order to solve the problem of data distribution difference in cross-project defect prediction, the invention improves the original self-encoder, and adds a shared parameter mechanism to obtain the self-encoder of a shared hidden layer. By minimizing the reconstruction error L (theta) all ) To obtain a depth characterization L (theta) of the hidden layer all ),L(θ all ) Comprises 2 parts: l (theta) tr ) And L (theta) te )。L(θ tr ) By calculating Euclidean distance of training data input and outputTo express their reconstruction errors, and at the same time, in order to fully utilize the label information in the source data, add intra-class loss, global inter-class loss and local inter-class loss, in order to maximize the inter-class distance and minimize the intra-class distance of the data in the source domain during the feature learning process. L (theta) tr ) Is composed of three parts including reconstruction error loss term and intra-class loss term
Figure BDA0002353780240000052
And inter-class loss term
Figure BDA0002353780240000053
Wherein the reconstruction error loss term is to reconstruct the input with better output; the intra-class loss term is to keep samples of the same class in the source data sufficiently close to the class center to achieve intra-class minimization; fully considering global inter-class loss terms
Figure BDA0002353780240000054
And local inter-class loss terms
Figure BDA0002353780240000055
Figure BDA0002353780240000056
In order to have the class centers of the two classes sufficiently distant,
Figure BDA0002353780240000057
in order to make each sample with the class of 0(1) as far as the center of the nearest k adjacent samples with the class of 1(0), the distance between the samples is as far as possible, so that the purpose of maximizing the inter-class distance is achieved. L (theta) te ) The reconstruction error between the input and output of the test data is represented by calculating the euclidean distance between them. L (theta) tr ) And L (theta) te ) Is defined as follows:
Figure BDA0002353780240000058
Figure BDA0002353780240000059
Figure BDA00023537802400000510
Figure BDA0002353780240000061
wherein
Figure BDA0002353780240000062
Means that for each sample with class 0, the mean of the k nearest neighbor samples from the sample with class 1 is selected; the same reason is that
Figure BDA0002353780240000063
Similar to the previous meaning.
Figure BDA0002353780240000064
Refers to the features of the training data set after decoding,
Figure BDA0002353780240000065
is a feature of the test data set after decoding.
Figure BDA0002353780240000066
Are samples of all classes 0 in the training data,
Figure BDA0002353780240000067
are samples of all classes 1 in the training data.
Figure BDA0002353780240000068
And
Figure BDA0002353780240000069
the decoded training data have a class of 0 and a class of 1, respectively.
Figure BDA00023537802400000610
And
Figure BDA00023537802400000611
respectively, a sample mean value of class 0 and a sample mean value of class 1 after decoding the training data. Combining the above two formulas, optimizing L (theta) simultaneously tr ) And L (theta) te ) The final objective function is expressed as follows:
L(θ all )=L(θ tr )+rL(θ te ) (9)
all parameters theta of the network that need to be optimized all Comprises the following steps:
Figure BDA00023537802400000612
r is a regularization parameter that facilitates regularization of the behavior of the self-encoder. The purpose of adding the regularization term is to make the feature distributions of the training data and the test data more and more similar by changing the value of r.
And 3, introducing a focus loss technology and training the improved focus loss classifier. In order to enable the network to discover and learn the characteristics of defective modules, the network is able to distinguish between defective modules and non-defective modules, since the number of defective modules is small due to the class imbalance problem of the data set itself. The invention introduces a focus loss technology, samples are balanced by distributing different weights to the samples of different categories in the training process, whether the samples are easy to classify is considered, the samples which are easy to classify are endowed with smaller weights, and the samples which are easy to be wrongly classified have larger weights, so that the category imbalance is relieved. And finally, training the classifier by using the deep characteristic representation of the training data obtained in the step 2. The classifier penalty C may use a cross-entropy penalty function to compute the similarity between the true label and the predicted label, defined as follows:
Figure BDA00023537802400000613
wherein N is tr Presentation trainingThe number of samples, c represents the category of the label, k represents the number of categories of the label, where the number of categories is 2,
Figure BDA00023537802400000614
a label representing the authenticity of the tag,
Figure BDA00023537802400000615
representing the probability of predicting the label, g (-) is the activation function.
Based on the classifier, we add two-part weight u and
Figure BDA0002353780240000071
a focus loss function is proposed. u is mainly to solve the class imbalance problem, and a small weight u (0 < u < 1) is given to a class (non-defective class) with a large number of samples, and a large weight 1-u (0 < u < 1) is given to a class (defective class) with a small number of samples, and the sum of the two weights is 1, so that the numbers of samples of two different classes are balanced.
Figure BDA0002353780240000072
The method mainly solves the problem of difficult classification of samples in the process of defect prediction learning, and for classes with a large number of samples, namely defect-free classes, the classifier can learn and judge which class belongs to more easily, so that a larger probability value is obtained
Figure BDA0002353780240000073
Thus the easier the samples to classify in defect prediction, the weights
Figure BDA0002353780240000074
The smaller the value. Vice versa, the more difficult samples to classify in defect prediction, the weights
Figure BDA0002353780240000075
The larger the value is, the more the classifier places more attention to samples difficult to classify, so that the classifier can better learn the difficult classificationCharacteristics of the sample. The final focus loss function can thus be expressed in the form:
Figure BDA0002353780240000076
in order to verify whether the algorithm has good superiority or not, the cross-project software defect prediction algorithm based on the focus loss shared hidden layer self-encoder is compared with other 5 cross-project defect prediction methods such as TCA +, TDS, Dycom, LT and SHLA (cross-project defect prediction algorithm of the focus loss-free shared hidden layer self-encoder). The 10 items of the project were compared and verified as experimental data, respectively, as shown in table 1: where # instance represents the number of instances, # defect represents the number of defective instances, and% defect represents the proportion of defective instances to all instances.
Table 1 10 entries in the project data set used in the experiment
Datasets #instance #defect %defect
ant-1.7 745 166 22.28
camel-1.6 965 188 19.48
jedit-3.2 272 90 33.09
log4j-1.0 135 34 25.19
lucene-2.0 195 91 46.67
poi-1.5 237 141 59.49
redaktor 176 27 15.34
synapse-1.0 157 16 10.19
xalan-2.6 885 411 46.44
xerces-1.3 453 69 15.23
The evaluation indexes of the prediction model are mainly F-measure and Accuracy. May be represented by the TP, FN, FP, and TN defined in Table 2, as shown in Table 2:
TABLE 2 confusion matrix
Figure BDA0002353780240000081
And (6) recall: the classifier predicts the defective samples in proportion to all the defective samples, i.e., call TP/(TP + FN). precision: the classifier predicts the correct prediction ratio of the defective samples, i.e., precision (TP/(TP + FP), and the ratio evaluates the correctness of the model prediction of the defective module. The F-measure index is a weighted average of the recill and precision, i.e., F-measure ═ (2 precision:)/(precision + precision). The Accuracy index is an index for evaluating the degree to which both defective and non-defective modules are correctly classified, i.e., Accuracy ═ TP + TN)/(TP + TN + FP + FN). The larger the numerical values of F-measure and Accuracy indicate the better the prediction performance of the software defect prediction model.
The experimental setup here is to select 1 of 10 items in the project from the project plan as test data (or target item), and the remaining 9 items in turn as source items (or training items). Thus there are 9 cross-project combinations for each target project, and there are 90 possible cross-project combinations for 10 projects. In the training process of the self-encoder, the model has 4 hidden layers, and the number of nodes of each hidden layer is set as: 20-15-10-10-2, where 20 refers to the characteristic dimension of the input data and 2 refers to the characteristic dimension of the incoming softmax classifier data. In the training process of the model, a ReLU activation function is adopted for each layer and the setting of the number of layers is empirically obtained, and an Adam optimizer is used when parameter optimization is performed. Each mini-batch in the experiment was set to 64, and the range setting for the hyperparameter r was: r ∈ {0.1,0.5,1,5,10,15}, and good effects were obtained when r ∈ 10.
To verify whether the algorithm herein performs well in several comparison algorithms, experiments were performed on 10 items of PROMISE, the experimental results of F-measure are shown in Table 3, and the experimental results of Accuracy are shown in Table 4:
TABLE 3 experimental results of our model and 4 comparison algorithms on F-measure
Figure BDA0002353780240000082
Figure BDA0002353780240000091
As can be seen from the experimental results in table 3: the F-measure values of our model exceed other 5-group contrast algorithms and the variation of the F-measure values ranges from 0.257 to 0.649, and our model improves the F-measure results by at least 0.019-0.418.
TABLE 4 results of our model experiments with 5 comparison algorithms on Accuracy
target TDS TCA+ Dycom LT SHLA Ours
ant-1.7 0.680 0.684 0.674 0.675 0.631 0.721
camel-1.6 0.742 0.618 0.769 0.722 0.731 0.639
jedit-3.2 0.593 0.663 0.710 0.599 0.702 0.722
log4g-1.0 0.715 0.657 0.763 0.726 0.711 0.716
lucene-2.0 0.538 0.621 0.600 0.533 0.621 0.637
poi-1.5 0.559 0.576 0.435 0.527 0.611 0.618
redaktor 0.579 0.556 0.386 0.648 0.361 0.495
synapse-1.0 0.761 0.641 0.796 0.643 0.592 0.613
xalan-2.6 0.417 0.591 0.603 0.531 0.582 0.611
xerces-1.3 0.714 0.627 0.764 0.757 0.810 0.814
average 0.630 0.623 0.650 0.636 0.635 0.659
improved 0.029 0.036 0.009 0.023 0.024 -
As can be seen from the experimental results in table 4: the Accuracy values of our model are improved somewhat over the other 5-group comparison algorithms, the Accuracy mean of our model is 0.659, and the results of Accuracy are improved by at least 0.009 ═ 0.659 to 0.650.
Through the above experiments, it can be seen that: TCA +, TDS, Dycom, LT and SHLA algorithms can have better F-measure value and Accuracy value on certain projects, but the model provided by the invention has better F-measure average value and Accuracy average value on the whole, and the effect is better than that of the former 5 algorithms, thereby indicating the superiority of the method provided by the invention.
In addition to the above embodiments, the present invention may have other embodiments. All technical solutions formed by adopting equivalent substitutions or equivalent transformations fall within the protection scope of the claims of the present invention.

Claims (2)

1. A cross-project software defect prediction method based on a shared hidden layer self-encoder is characterized by comprising the following steps:
(1) dividing a pre-acquired data set into a training set and a testing set, and performing data preprocessing;
(2) extracting features by adopting a self-encoder with a sharing mechanism, and respectively extracting depth features of a training set and a test set;
(3) introducing a focus loss function, and training a classifier;
the shared mechanism self-encoder in step (2) is a self-encoder for adding a shared parameter mechanism to obtain a shared hidden layer, and the implementation process is as follows: by minimizing the reconstruction error L (theta) all ) To obtain a depth characterization L (theta) of the hidden layer all ),L(θ all ) Comprises two parts: l (theta) tr ) And L (theta) te ),L(θ tr ) And L (theta) te ) Is defined as follows:
Figure FDA0003727977240000011
Figure FDA0003727977240000012
Figure FDA0003727977240000013
Figure FDA0003727977240000014
wherein, L (theta) tr ) Euclidean distance for input and output of training data to represent reconstruction error between input and output of training data, L (θ) tr ) Is composed of three parts including reconstruction error loss term and intra-class loss term
Figure FDA0003727977240000015
And inter-class loss term
Figure FDA0003727977240000016
In order to be a global inter-class loss term,
Figure FDA0003727977240000017
is a local inter-class loss term, L (θ) te ) The euclidean distance between the input and output of the test data, to represent the reconstruction error between the input and output of the test data,
Figure FDA0003727977240000018
means that for each sample with class 0, the mean of the k nearest neighbor samples from the sample with class 1 is selected;
Figure FDA0003727977240000019
for each sample with class 1, selecting the mean of the k nearest neighbor samples from the samples with class 0,
Figure FDA00037279772400000110
refers to the features of the training data set after decoding,
Figure FDA00037279772400000111
is a feature of the test data set after decoding,
Figure FDA00037279772400000112
are samples of all classes 0 in the training data,
Figure FDA0003727977240000021
are samples of all classes 1 in the training data,
Figure FDA0003727977240000022
and
Figure FDA0003727977240000023
the decoded training data with class 0 and class 1 respectively,
Figure FDA0003727977240000024
and
Figure FDA0003727977240000025
respectively, a sample mean value of class 0 and a sample mean value of class 1 after decoding the training data, while optimizing L (theta) tr ) And L (theta) te ) The final objective function is expressed as follows:
L(θ all )=L(θ tr )+rL(θ te ) (9)
wherein, L (θ) all ) All parameters theta of the network that need to be optimized for the depth characteristics of the hidden layer all Comprises the following steps:
Figure FDA0003727977240000026
r is a regularization parameter;
the focus loss function in the step (3) is realized by the following formula:
Figure FDA0003727977240000027
wherein N is tr Representing the number of training samples, c representing the class of labels, k representing the number of label classes,
Figure FDA0003727977240000028
a label representing the authenticity of the tag,
Figure FDA0003727977240000029
representing the probability of a predictive label, g (-) is an activation function, u is a sample class weight, a small weight u (0 < u < 1) is given for non-defective class samples, a large weight 1-u (0 < u < 1) is given for defective class samples, and the sum of the two weights is 1,
Figure FDA00037279772400000210
for samples that are more easily classified in defect prediction, weights are
Figure FDA00037279772400000211
The smaller the value is, the more difficult to classify samples in defect prediction, the weight
Figure FDA00037279772400000212
The larger the value.
2. The method for cross-project software defect prediction based on shared hidden layer self-encoder as claimed in claim 1, wherein the preprocessing of step (1) is implemented by the following formula:
Figure FDA00037279772400000213
wherein, P i The characteristic value after normalization preprocessing is carried out on a given characteristic x, wherein x is the given characteristic, the maximum value and the minimum value of the given characteristic x are respectively expressed as max (x) and min (x), and x is i For each eigenvalue of the characteristic x.
CN202010001850.9A 2020-01-02 2020-01-02 Cross-project software defect prediction method based on shared hidden layer self-encoder Active CN111198820B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010001850.9A CN111198820B (en) 2020-01-02 2020-01-02 Cross-project software defect prediction method based on shared hidden layer self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010001850.9A CN111198820B (en) 2020-01-02 2020-01-02 Cross-project software defect prediction method based on shared hidden layer self-encoder

Publications (2)

Publication Number Publication Date
CN111198820A CN111198820A (en) 2020-05-26
CN111198820B true CN111198820B (en) 2022-08-26

Family

ID=70746714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010001850.9A Active CN111198820B (en) 2020-01-02 2020-01-02 Cross-project software defect prediction method based on shared hidden layer self-encoder

Country Status (1)

Country Link
CN (1) CN111198820B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015659A (en) * 2020-09-02 2020-12-01 三维通信股份有限公司 Prediction method and device based on network model
CN112199280B (en) * 2020-09-30 2022-05-20 三维通信股份有限公司 Method and apparatus for predicting software defects, storage medium, and electronic apparatus
CN117421244B (en) * 2023-11-17 2024-05-24 北京邮电大学 Multi-source cross-project software defect prediction method, device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108446711A (en) * 2018-02-01 2018-08-24 南京邮电大学 A kind of Software Defects Predict Methods based on transfer learning
CN109710512A (en) * 2018-12-06 2019-05-03 南京邮电大学 Neural network software failure prediction method based on geodesic curve stream core

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108446711A (en) * 2018-02-01 2018-08-24 南京邮电大学 A kind of Software Defects Predict Methods based on transfer learning
CN109710512A (en) * 2018-12-06 2019-05-03 南京邮电大学 Neural network software failure prediction method based on geodesic curve stream core

Also Published As

Publication number Publication date
CN111198820A (en) 2020-05-26

Similar Documents

Publication Publication Date Title
CN111198820B (en) Cross-project software defect prediction method based on shared hidden layer self-encoder
CN102567464B (en) Based on the knowledge resource method for organizing of expansion thematic map
CN109919489B (en) Enterprise asset management system and GA-BP-based enterprise equipment life prediction method
CN111325264A (en) Multi-label data classification method based on entropy
CN112634992A (en) Molecular property prediction method, training method of model thereof, and related device and equipment
CN111047193A (en) Enterprise credit scoring model generation algorithm based on credit big data label
Wei [Retracted] A Method of Enterprise Financial Risk Analysis and Early Warning Based on Decision Tree Model
CN105046323A (en) Regularization-based RBF network multi-label classification method
Zeng et al. Research on audit opinion prediction of listed companies based on sparse principal component analysis and kernel fuzzy clustering algorithm
CN117472789B (en) Software defect prediction model construction method and device based on ensemble learning
CN112418987B (en) Method and system for rating credit of transportation unit, electronic device and storage medium
Chen Financial Statement Fraud Detection based on Integrated Feature Selection and Imbalance Learning
CN111625578B (en) Feature extraction method suitable for time series data in cultural science and technology fusion field
CN110084376B (en) Method and device for automatically separating data into boxes
Muthukumaran et al. Feature Selection with Optimal Variational Auto Encoder for Financial Crisis Prediction.
CN116091485A (en) Flotation process foam image feature selection method and device based on sensitive mutual information
Kuang [Retracted] Prediction of Urban Scale Expansion Based on Genetic Algorithm Optimized Neural Network Model
CN115600913A (en) Main data identification method for intelligent mine
CN113379037A (en) Multi-label learning method based on supplementary label collaborative training
CN114357869A (en) Multi-objective optimization agent model design method and system based on data relation learning and prediction
Cordón et al. Special Issue on Hybrid and Ensemble Methods in Machine Learning.
Kashef et al. MLIFT: enhancing multi-label classifier with ensemble feature selection
CN116071636B (en) Commodity image retrieval method
Zhou et al. Category encoding method to select feature genes for the classification of bulk and single‐cell RNA‐seq data
CN116402241B (en) Multi-model-based supply chain data prediction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 210003 Gulou District, Jiangsu, Nanjing new model road, No. 66

Applicant after: NANJING University OF POSTS AND TELECOMMUNICATIONS

Address before: Yuen Road Qixia District of Nanjing City, Jiangsu Province, No. 9 210046

Applicant before: NANJING University OF POSTS AND TELECOMMUNICATIONS

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant