CN117421244B - Multi-source cross-project software defect prediction method, device and storage medium - Google Patents
Multi-source cross-project software defect prediction method, device and storage medium Download PDFInfo
- Publication number
- CN117421244B CN117421244B CN202311540803.1A CN202311540803A CN117421244B CN 117421244 B CN117421244 B CN 117421244B CN 202311540803 A CN202311540803 A CN 202311540803A CN 117421244 B CN117421244 B CN 117421244B
- Authority
- CN
- China
- Prior art keywords
- source
- item
- sample
- encoder
- discriminator
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000007547 defect Effects 0.000 title claims abstract description 91
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000012549 training Methods 0.000 claims abstract description 42
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 238000009826 distribution Methods 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 2
- 238000012795 verification Methods 0.000 abstract description 2
- 238000002474 experimental method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000002950 deficient Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 241000540325 Prays epsilon Species 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000013522 software testing Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3604—Software analysis for verifying properties of programs
- G06F11/3608—Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Mathematical Physics (AREA)
- Evolutionary Biology (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Stored Programmes (AREA)
Abstract
The invention discloses a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium. The method comprises the following steps: inputting a plurality of source item data sets and a target item data set; extracting sample features for all data sets using an encoder; training a discriminator on the project label after reversing the sample characteristic gradient; calculating the maximum mean value difference of the characteristics of each source item and each target item, taking the correlation between the target sample output by the discriminator and the source sample as an attention score, and taking weighted summation of a plurality of maximum mean value differences as the loss of the encoder; establishing a defect class classifier; integrally training an encoder, a discriminator and a classifier; and performing feature extraction and defect classification on the target item data set by using the trained encoder and classifier. The device comprises an input module, an encoder G, a discriminator D, a classifier C and a gradient inversion module. The method realizes the prediction of the multi-source cross-project software defects, and has high defect identification accuracy through experimental verification.
Description
Technical Field
The invention belongs to the technical field of software testing, and particularly relates to a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium thereof.
Background
Defect prediction is a key problem in the field of software engineering. However, in real-world engineering projects, there are several problems to be solved in order to perform defect class prediction. First, the number of defect categories and defects of a project are limited, and the sample size of defects may not be sufficient to support model training. Second, the distribution of defects in the project exhibits long tail effects. For tail defects, although small in number, serious problems may be caused. Traditional cross-project software bug predictions involve only one source project and one target project. The multi-source cross-project software defect prediction can be performed by utilizing a plurality of source projects and one target project, and has wide practical value. But both are typically weaker than intra-project defect predictions due mainly to differences in features in the source and target projects, and inconsistent distributions.
The existing multi-source cross-project software defect prediction method is low in prediction accuracy and reliability due to the fact that feature distribution in a source project and feature distribution in a target project are different, and the requirement on the multi-source cross-project software defect prediction result is difficult to meet.
Disclosure of Invention
The invention provides a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium thereof, which are used for solving the problem that the characteristic distribution difference of a source project and a target project in the prior art has a great negative influence on a prediction result.
The invention provides a multi-source cross-project software defect prediction method, which comprises the following steps:
Step 1, inputting a plurality of source project data sets and a target project data set;
The data set of the target item or each source item comprises samples of K types of defect categories, and each sample is provided with a defect category label and an item label; the defect type label of the source item sample is known, and the defect type label of the target item is unknown; the item tag is used for marking which source item or target item the sample belongs to;
step 2, extracting features from samples in all project data sets by using an encoder G;
each sample is a source code file, firstly, word segmentation is carried out on a source code to obtain a corresponding token id, and then the token id is input into an encoder G for feature extraction;
step 3, performing gradient inversion operation on the extracted sample characteristics, and training a discriminator D by using the sample characteristics with the gradient inversion;
The input of the discriminator D comprises sample characteristics of all source items and target items after gradient inversion, and the probability of each sample belonging to different items is output; in the training process, the probability that one sample feature of the target item belongs to different source items is obtained as the attention score, and meanwhile, the countermeasure training loss of the discriminator D is calculated;
Step 4, calculating the maximum mean value difference of the sample characteristics of each source item and each target item, and carrying out weighted summation on the maximum mean value difference of the sample characteristics of each source item and each target item by using the attention score to obtain the coding loss of the encoder G;
step 5, a classifier C is established, the input of the classifier C is the sample characteristics of the source item extracted in the step 2, and the probability that the sample belongs to various defect types is output;
Step 6, comprehensively training the encoder G, the discriminator D and the classifier C, and updating model parameters;
loss during comprehensive training is: Wherein/> Is the challenge training penalty of discriminator D,/>Is the coding loss of encoder G,/>Is the classification loss of classifier C; /(I)Update acting on encoder G and discriminator D,/>Update acting on encoder G,/>Updates acting on encoder G and classifier C;
and 7, obtaining a trained encoder G and a classifier C, segmenting a source code file of the target item, inputting the segmented source code file into the encoder G to extract sample characteristics, and inputting the sample characteristics into the classifier C to perform defect type recognition.
Correspondingly, the invention provides a multi-source cross-project software defect prediction device, which comprises the following functional modules:
An input module that receives input of a plurality of source item datasets and a target item dataset; the data set of the target item or each source item comprises samples of K types of defect categories, and each sample is provided with a defect category label and an item label; the defect type label of the source item sample is known, and the defect type label of the target item is unknown; the item tag is used for marking which source item or target item the sample belongs to;
An encoder G for extracting features from samples in all item data sets; each sample is a source code file, firstly, the source code is segmented to obtain a corresponding token id, and then the token id is input into an encoder G for feature extraction;
the gradient inversion module is used for performing gradient inversion operation on the extracted sample characteristics;
A discriminator D for inputting sample characteristics including all source items and target items after gradient inversion and outputting probabilities that each sample belongs to different items;
the classifier C inputs the sample characteristics of the extracted source items and outputs the probability that the sample belongs to various defect types;
wherein, the encoder G, the discriminator D and the classifier C are comprehensively trained, and the loss is reduced Is the challenge training penalty of discriminator D,/>Is the coding loss of encoder G,/>Is the classification loss of classifier C; Update acting on encoder G and discriminator D,/> Update acting on encoder G,/>Updates acting on encoder G and classifier C; coding loss/>, of encoder GThe acquisition mode is as follows: obtaining the probability that one sample feature of the target item corresponds to the different source items by using the discriminator D as an attention score, calculating the maximum mean value difference of the sample features of each source item and target item, and carrying out weighted summation on the maximum mean value difference of the sample features of each source item and target item by using the attention score to obtain the coding loss of the encoder G;
After the trained encoder G and classifier C are obtained, the source code file of the target item is segmented, then the source code file is input into the encoder G to extract sample characteristics, and then the source code file is input into the classifier C to carry out defect type recognition.
Further, the present invention provides a readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements a multi-source cross-project software defect prediction method of the present invention.
The invention has the advantages and positive effects that: the invention discloses a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium thereof, wherein the problem of characteristic distribution difference is relieved by reducing the maximum mean difference of characteristic distribution of source projects and target projects, the project domain correlation is acquired by introducing counterlearning in a discriminator, and the maximum mean difference between different source projects and target projects is weighted by combining the project domain correlation, so that the problem of characteristic distribution difference is further relieved. Based on the comprehensive training model on the basis of relieving the characteristic distribution difference, a trained characteristic extraction encoder and a defect prediction classifier are obtained, and experiments prove that the multi-source cross-project software defect prediction can be realized, and the defect prediction accuracy is high.
Drawings
FIG. 1 is a flow chart of an implementation of the multi-source cross-project software defect prediction method of the present invention.
Detailed Description
The invention will be described in further detail with reference to the drawings and examples.
The invention provides a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium thereof, which are realized based on countermeasure training and attention mechanisms. The method comprises the following steps: inputting a plurality of source item data sets and a target item data set; extracting features for all data sets using an encoder; training a discriminator on the item tags of the source item and the target item by countermeasure training; respectively calculating the maximum mean difference of the characteristics of each source item and each target item; outputting, by the discriminator, a correlation between the target sample and the source sample as a concentration score; weighting the plurality of maximum mean differences in combination with the attention score; training a classifier on the class labels of the source items; training and updating parameters of an encoder, a discriminator and a classifier; inputting a target item data set; classifying on the target item data set by using a classifier; and outputting the classification result.
As shown in FIG. 1, the multi-source cross-project software defect prediction method implemented by the embodiment of the invention comprises the following steps S200 to S216.
S200, inputting a source project data set and a target project data set.
In the present invention, the source item is a software item for which a software defect is known, and the category label Y of the source item dataset X s is known. The target item is a software item that needs to be judged for defects, and the category label Y t of the target item dataset X t is unknown. Furthermore, the item tags d of the source item and the target item are considered to exist, i.e. each sample is known to belong to either the source item or the target item. Training is performed using source item data of defective category labels and item labels, and target items of non-defective category labels and item labels as training sets.
In one scenario of the present invention, a set of M source item datasets S 1,S2,…SM are provided, where the j-th source item S j is a datasetRepresentation, wherein/>Is the number of samples contained in the source item S *,/>A markup sample representing a source item S j,/>And representing the defect type label corresponding to the sample. Also given a target item data set T, expressed asWherein/>Is the i-th unlabeled specimen whose defect class label Y t is unknown, and n t is the number of specimens in the target item T. Sample number/>, of all source items marking defectsAll samples of the source and target samples n=n s+nt, and all samples are labeled with item label d i to obtain sample set/>Item tag d i is a vector of M+1, if sample X i belongs to source item S j, then the data bit of item tag d i corresponding to source item S j takes a value of 1, and the rest are 0. Each source item dataset S j, j e1, m contains the same defect categories as in the target item dataset T, i.e. Y t consists of the same source domain as any/> The same defect class composition. For example, each of the source item dataset and the target item dataset contains a sample of 44 types of defects.
Similar to single source project software defect identification, Y t is not available during the training process and is only used for evaluation. The object of the invention is therefore to train a model with minimal prediction error in the target domain using the labeled source data. Note that the distribution of samples in the source and target domains is different, and thus, a classifier trained on the source domain typically does not predict the target domain well.
S201, extracting features for all item data sets by the encoder G.
In the source and target item data sets, there is a multidimensional feature space in which feature extraction is performed on samples in each source and target item using a pre-trained model CodeT based on a transducer architecture as encoder G. In the embodiment of the invention, the source codes of the samples of the source item and the target item are used as input, the source codes are firstly segmented to obtain the corresponding token ids, then the token ids are used as the input of CodeT, and the CodeT is used for extracting the characteristics. The purpose of feature extraction is to classify the sample in a subsequent step.
In the embodiment of the invention, each sample in the source item and the target item is a source code file, firstly, the code file is segmented, then, the feature is extracted through CodeT, and a feature is extracted for each code file contained in the source item and the target item. The length of the extracted features of each sample may be set to be the same as the number of extracted features of each item data set for convenience of subsequent processing. If the number of code files contained in the source item data set is different from the number of code files contained in the target item data set, repeated feature extraction can be performed on some code files in the items with small numbers so as to ensure that the feature numbers output by the source item and the target item are the same.
S202, performing gradient inversion operation on the extracted features.
And carrying out gradient inversion operation on the sample characteristics of the extracted source item and target item. The feature of gradient inversion automatically reverses the gradient direction during the back propagation. The gradient inversion layer exists between the feature extractor G and the discriminator D, and in the back propagation process, the gradient of the domain classification loss of the domain classifier is automatically inverted before being back propagated to the parameters of the feature extractor, so that the countering loss is realized. After the operation of step S202, sample features of the source item and the target item and sample features after gradient inversion are obtained.
S203, training the discriminator D by using the item target.
In the present invention, the item tags of the source item and the target item refer to which item the sample comes from, and the identifier can be provided with the capability of identifying which item the sample comes from by training the identifier through the item tag. The present invention uses only the gradient-inverted sample features to train the discriminator D. The input of the discriminator D contains sample features of M source items and target items, and outputs item tags corresponding to the samples. Each item tag is a vector of dimension m+1.
In the embodiment of the invention, the discriminator D is realized by a full-connection layer, and the loss of the discriminator of the characteristics which are not subjected to gradient inversion is thatWhere n represents the total number of samples of the source and target items, D i represents the item tag of sample X i, G () represents the encoder, D () represents the discriminator, superscript/>Representing the transpose.
In the embodiment of the invention, only the gradient inversion feature is used to train the discriminator D. Inputting the sample characteristics after gradient inversion into the discriminator D, the discriminator D trains towards the direction of erroneously discriminating from which item the sample comes, thereby achieving the purpose of confusing the discriminator D, and smoothing the domain correlation obtained by the discriminator D, wherein the challenge training loss of the discriminator D is expressed as:
for parameter λ, it is gradually changed from 0 to 1 using the following formula:
The parameter p increases with the number of training rounds, p.epsilon.0, 1.
In the embodiment of the invention, during training, for example, 7 source items and 1 target item are shared, 1 item in the 8 items is sequentially taken as the target item, the rest 7 items are the source items, each batch takes 1 sample feature from the 8 items, the discriminator D outputs 8-dimension item labels corresponding to 8 samples, and the value in each dimension represents the probability that the sample belongs to the corresponding item.
S204, obtaining the attention scores of the target item and the source item.
By discriminating a sample of the target item by the discriminator D, the similarity of the target item and the source item can be identified as a attention score for the purpose of weighting the loss at a later time. The probability that one sample feature of the target item corresponds to a different source item is obtained as the attention score. The output w of discriminator D is the probability that the input target sample X i belongs to a different source item, w=softmax (D (G (X i))). A larger probability means that the input target sample is more similar to the source item and vice versa. Taking seven source items and one target item as an example, assuming that for a certain target item sample, the discriminator D outputs probabilities w= [0.1,0.6,0.1,0.1,0.05,0.05,0] of corresponding 7 source items, the similarity between the input target sample and the second source item is the highest, and the similarity between the input target sample and the last item is the lowest.
S205, obtaining the countermeasure training loss.
In S202, the gradient inversion decreases the gradient that should have been increased at the time of updating, and the gradient that should have been decreased increases, thereby confusing the discrimination capability of the discriminator D for the target sample, and extracting more domain invariant features.
From the prediction results of the sample predictions for the target item, which belong to different source items, classification losses are calculated by S204 and S205, where classification losses are calculated using cross entropy losses.
S206, calculating the maximum mean difference for the characteristics of each source item and each target item.
For each source item, a maximum mean difference of the sample features of the source item and the sample features of the target item is calculated as follows:
wherein, Representing the maximum mean difference between the source item S j and the target item T,/>For the sample number of the source item S j, n t is the sample number of the target item,/>For the ith sample of source item S j,/>An h sample representing the target item, gaussian kernel/>Sigma represents the standard deviation of the gaussian distribution.
Reducing the maximum mean difference between the source item and the target item may reduce the feature distribution difference between the source item and the target item.
S207, weighting the plurality of maximum mean differences.
In combination with the attention score obtained in S204, the plurality of maximum mean differences calculated in S206 are weighted, which has the advantage that the influence of similar source items on target items is amplified, and the influence of irrelevant source items on target items is reduced, so that:
Where w j denotes the similarity of the target sample and the source item j, Representing a weighted sum of the maximum mean differences for all source and target items.
S208, the weighted maximum mean difference loss is obtained as the loss of the encoder G.
In S207, the weighted maximum mean difference is obtained as a loss term for the encoder GFurther reducing the feature distribution difference of the source item and the target item.
S209, training a classifier C for the class labels of the source items.
In the present invention, in addition to aligning the feature distribution of the target item and the source item, the representation learning on each source item is also important to the classification result. And training a classifier C by combining the sample characteristics of the source item extracted in the step S201 and the class labels of the samples, wherein the input of the classifier C is the sample characteristics of the source item which are not subjected to gradient inversion, and the input of the classifier C is a specific defect class.
S210, obtaining the classification loss.
The classification loss obtained by classifying in S209 is used as a loss term, i.eWherein Y i represents sample/>If the defect class has 44 classes, the defect class vector is a 44-dimensional vector, and each data bit corresponds to a probability of a defect.
S211, training and updating parameters of the encoder G, the discriminator D and the classifier C.
In the present invention, a complete model comprising encoder G, discriminator D and classifier C is trained by three losses from S205, S208 and S210, respectively, i.e
In the present invention, lossUpdate related to encoder G and discriminator D,/>Involving updating of encoder G,/>To the updating of the encoder G and the classifier C. The training data set is used for training the whole model, and model parameters are updated.
S212, outputting the trained encoder G and the classifier C.
In the test stage of the present invention, the target samples are tested using the encoder G and the classifier C trained in steps S201 to S211.
S213, inputting the target item data set.
In practice, the class label of the target item is empty, and in the present invention, the class label of the target item dataset is considered to be present but not labeled. After the source code file in the target item is segmented, the encoder G extracts the characteristics.
S214, predicting on the target item data set by using the classifier C.
The classifier C classifies the samples in the target item dataset based on the information learned on the target item dataset.
S215, calculating a prediction result of the classifier C. And summarizing the results of the classifier C, and counting the number according to different defect categories.
S216, outputting a classification result.
Correspondingly, the invention can also realize a multi-source cross-project software defect prediction device according to the method, which comprises the following steps: an input module, an encoder G, a discriminator D, a classifier C and a gradient inversion module. Wherein the input module receives input of a plurality of source item datasets and a target item dataset; the data set of the target item or each source item comprises samples of K types of defect categories, and each sample is provided with a defect category label and an item label; the defect class labels of the source item samples are known and the defect class labels of the target items are unknown. The encoder G extracts characteristics from samples in all project data sets; each sample is a source code file, firstly, the source code is segmented to obtain a corresponding token id, and then the token id is input into the encoder G for feature extraction. The gradient inversion module performs gradient inversion operation on the extracted sample features. The discriminator D trains by using sample characteristics including all source items and target items after gradient inversion, and outputs probabilities that each sample belongs to different items. The classifier C inputs the extracted sample features of the source item and outputs probabilities that the samples belong to various defect types.
The encoder G, the discriminator D and the classifier C are comprehensively trained as a whole, model parameters are updated, and a Loss function Loss during training is as described in S211. The encoder G uses the attention score to reduce the maximum mean difference of the source item and the target item so as to achieve the purpose of reducing the characteristic distribution difference of the source item and the target item. Coding loss of encoder GThe acquisition mode is as follows: obtaining the probability that one sample feature of the target item corresponds to the belonging to different source items by using the discriminator D as an attention score, calculating the maximum mean difference of the sample features of each source item and target item, and weighting and summing the maximum mean differences of the sample features of each source item and target item by using the attention score as the coding loss/>, of the encoder GAfter training, the trained encoder G and classifier C are output.
And (3) utilizing the trained encoder G and classifier C, segmenting the source code file of the target item, inputting the segmented source code file into the encoder G to extract sample characteristics, and inputting the sample characteristics into the classifier C to perform defect type recognition.
Furthermore, the present invention is based on the above method, and further realizes a readable storage medium having a computer program stored thereon, which when executed by a processor, implements a multi-source cross-project software defect prediction method as described above.
The multi-source cross-project software defect prediction method is subjected to experimental verification, and the defect classification accuracy ACC index is shown in table 1.
TABLE 1 accuracy of multi-source cross-project software defect prediction using the method of the present invention
As shown in table 1, apache JMeter, apache Jena, APACHE LENYA, … … JTree of the first row are 8 items. In the experiment, each item is used as a target item, the rest 7 are used as source items, the defect classification is carried out by using the method of the invention, the accuracy of sample defect classification of the target item in each experiment is shown in the second row in the table 1, and the result shows that the method of the invention realizes the prediction of multi-source cross-item software defects, ACC is above 0.9, and the prediction accuracy of sample defects of the target item is high.
Other than the technical features described in the specification, all are known to those skilled in the art. Descriptions of well-known components and well-known techniques are omitted so as to not unnecessarily obscure the present application. The embodiments described in the above examples are not intended to represent all the embodiments consistent with the present application, and various modifications or variations may be made by those skilled in the art without the need for inventive effort on the basis of the technical solutions of the present application while remaining within the scope of the present application.
Claims (6)
1. A multi-source cross-project software defect prediction method is characterized by comprising the following steps:
Step 1, inputting a plurality of source project data sets and a target project data set;
The data set of the target item or each source item comprises samples of K types of defect categories, and each sample is provided with a defect category label and an item label; the defect type label of the source item sample is known, and the defect type label of the target item is unknown; the item tag is used for marking which source item or target item the sample belongs to;
step 2, extracting features from samples in all project data sets by using an encoder G;
each sample is a source code file, firstly, word segmentation is carried out on a source code to obtain a corresponding token id, and then the token id is input into an encoder G for feature extraction;
step 3, performing gradient inversion operation on the extracted sample characteristics, and training a discriminator D by using the sample characteristics with the gradient inversion;
The input of the discriminator D comprises sample characteristics of all source items and target items after gradient inversion, and the probability of each sample belonging to different items is output; in the training process, the probability that one sample feature of the target item belongs to different source items is obtained as the attention score, and meanwhile, the countermeasure training loss of the discriminator D is calculated;
discriminator D is implemented using a fully connected layer, the countertraining of discriminator D being lost Expressed as:
Wherein n represents the number of all sample features of all source and target items, X i represents the ith sample feature therein, d i is the item label of the ith sample, and the upper corner mark Representing a transpose; g represents an encoder, D represents a discriminator; the parameter lambda is gradually changed from 0 to 1 using the following formula,/>The parameter p increases along with the number of training rounds, and p is E [0,1];
Step 4, calculating the maximum mean value difference of the sample characteristics of each source item and each target item, and carrying out weighted summation on the maximum mean value difference of the sample characteristics of each source item and each target item by using the attention score to obtain the coding loss of the encoder G;
let j-th source item S j include The number of samples is calculated as the maximum mean difference/>, of the sample features of the source item S j and the target item TThe following are provided:
Where n t is the number of samples of the target item T, For the ith sample of source item S j,/>An h sample representing the target item T, gaussian kernel/>Sigma represents the standard deviation of the gaussian distribution, X, X' being two sample features;
Coding loss of encoder G And carrying out weighted summation according to the maximum mean difference of all the source items and the target items, wherein the weighted summation is obtained as follows:
wherein, Representing weighted summation of the maximum mean differences for all source and target items; s represents a source item set, comprising M source items; w j is the attention score of the target item for the source item S j;
step 5, a classifier C is established, the input of the classifier C is the sample characteristics of the source item extracted in the step 2, and the probability that the sample belongs to various defect types is output;
classification loss of classifier C The following are provided:
Wherein Y i represents sample/> Is a classifier;
Step 6, comprehensively training the encoder G, the discriminator D and the classifier C, and updating model parameters;
loss during comprehensive training is: Wherein/> Is the challenge training penalty of discriminator D,/>Is the coding loss of encoder G,/>Is the classification loss of classifier C; /(I)Update acting on encoder G and discriminator D,/>Update acting on encoder G,/>Updates acting on encoder G and classifier C;
and 7, obtaining a trained encoder G and a classifier C, segmenting a source code file of the target item, inputting the segmented source code file into the encoder G to extract sample characteristics, and inputting the sample characteristics into the classifier C to perform defect type recognition.
2. The method according to claim 1, wherein in the step 2, a pre-training model CodeT based on a transducer architecture is used as the encoder G.
3. The method according to claim 1 or 2, wherein in the step 2, the feature length of each extracted sample is set to be the same; and setting that the feature quantity extracted by each source item data set and each target item data set is the same, and carrying out repeated feature extraction of samples on the data sets with few samples so as to increase the feature quantity output by the data sets.
4. The method according to claim 1, wherein in step 3, during each training round, 1 sample feature is taken from each source item and target item as input of a discriminator D, the discriminator D outputs an item label corresponding to each sample, the item label is a vector in m+1 dimensions, M is the number of source items, and a value in each dimension indicates a probability that the sample belongs to the corresponding item.
5. A multi-source cross-project software defect prediction device, characterized in that the device implements a multi-source cross-project software defect prediction method according to any one of claims 1to 4, the device comprising the following functional modules:
An input module that receives input of a plurality of source item datasets and a target item dataset; the data set of the target item or each source item comprises samples of K types of defect categories, and each sample is provided with a defect category label and an item label; the defect type label of the source item sample is known, and the defect type label of the target item is unknown; the item tag is used for marking which source item or target item the sample belongs to;
An encoder G for extracting features from samples in all item data sets; each sample is a source code file, firstly, the source code is segmented to obtain a corresponding token id, and then the token id is input into an encoder G for feature extraction;
the gradient inversion module is used for performing gradient inversion operation on the extracted sample characteristics;
A discriminator D for inputting sample characteristics including all source items and target items after gradient inversion and outputting probabilities that each sample belongs to different items;
the classifier C inputs the sample characteristics of the extracted source items and outputs the probability that the sample belongs to various defect types;
wherein, the encoder G, the discriminator D and the classifier C are comprehensively trained, and the loss is reduced Is the challenge training penalty of discriminator D,/>Is the coding loss of encoder G,/>Is the classification loss of classifier C; /(I)Update acting on encoder G and discriminator D,/>Update acting on encoder G,/>Updates acting on encoder G and classifier C; coding loss/>, of encoder GThe acquisition mode is as follows: obtaining the probability that one sample feature of the target item corresponds to the different source items by using the discriminator D as an attention score, calculating the maximum mean value difference of the sample features of each source item and target item, and carrying out weighted summation on the maximum mean value difference of the sample features of each source item and target item by using the attention score to obtain the coding loss of the encoder G;
After the trained encoder G and classifier C are obtained, the source code file of the target item is segmented, then the source code file is input into the encoder G to extract sample characteristics, and then the source code file is input into the classifier C to carry out defect type recognition.
6. A readable storage medium having stored thereon a computer program, which when executed by a processor implements a multi-source cross-project software defect prediction method as claimed in any of claims 1-2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311540803.1A CN117421244B (en) | 2023-11-17 | 2023-11-17 | Multi-source cross-project software defect prediction method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311540803.1A CN117421244B (en) | 2023-11-17 | 2023-11-17 | Multi-source cross-project software defect prediction method, device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117421244A CN117421244A (en) | 2024-01-19 |
CN117421244B true CN117421244B (en) | 2024-05-24 |
Family
ID=89526493
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311540803.1A Active CN117421244B (en) | 2023-11-17 | 2023-11-17 | Multi-source cross-project software defect prediction method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117421244B (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304316A (en) * | 2017-12-25 | 2018-07-20 | 浙江工业大学 | A kind of Software Defects Predict Methods based on collaboration migration |
CN108446711A (en) * | 2018-02-01 | 2018-08-24 | 南京邮电大学 | A kind of Software Defects Predict Methods based on transfer learning |
CN111198820A (en) * | 2020-01-02 | 2020-05-26 | 南京邮电大学 | Cross-project software defect prediction method based on shared hidden layer self-encoder |
CN113157564A (en) * | 2021-03-17 | 2021-07-23 | 江苏师范大学 | Cross-project defect prediction method based on feature distribution alignment and neighborhood instance selection |
CN113419948A (en) * | 2021-06-17 | 2021-09-21 | 北京邮电大学 | Method for predicting defects of deep learning cross-project software based on GAN network |
CN114328174A (en) * | 2021-11-10 | 2022-04-12 | 三维通信股份有限公司 | Multi-view software defect prediction method and system based on counterstudy |
CN114548152A (en) * | 2022-01-17 | 2022-05-27 | 上海交通大学 | Method for predicting residual life of marine sliding bearing based on transfer learning |
CN114564410A (en) * | 2022-03-21 | 2022-05-31 | 南通大学 | Software defect prediction method based on class level source code similarity |
CN114968774A (en) * | 2022-05-17 | 2022-08-30 | 北京航空航天大学 | Multi-source heterogeneous cross-project software defect prediction method |
CN115293057A (en) * | 2022-10-10 | 2022-11-04 | 深圳先进技术研究院 | Wind driven generator fault prediction method based on multi-source heterogeneous data |
KR20230122370A (en) * | 2022-02-14 | 2023-08-22 | 한국과학기술원 | Method and system for predicting heterogeneous defect through correlation-based selection of multiple source projects and ensemble learning |
CN116756041A (en) * | 2023-07-19 | 2023-09-15 | 中山大学 | Code defect prediction and positioning method and device, storage medium and computer equipment |
CN117056226A (en) * | 2023-08-18 | 2023-11-14 | 郑州轻工业大学 | Cross-project software defect number prediction method based on transfer learning |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11087174B2 (en) * | 2018-09-25 | 2021-08-10 | Nec Corporation | Deep group disentangled embedding and network weight generation for visual inspection |
EP3767536A1 (en) * | 2019-07-17 | 2021-01-20 | Naver Corporation | Latent code for unsupervised domain adaptation |
US11580425B2 (en) * | 2020-06-30 | 2023-02-14 | Microsoft Technology Licensing, Llc | Managing defects in a model training pipeline using synthetic data sets associated with defect types |
CN112508300B (en) * | 2020-12-21 | 2023-04-18 | 北京百度网讯科技有限公司 | Method for establishing risk prediction model, regional risk prediction method and corresponding device |
-
2023
- 2023-11-17 CN CN202311540803.1A patent/CN117421244B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304316A (en) * | 2017-12-25 | 2018-07-20 | 浙江工业大学 | A kind of Software Defects Predict Methods based on collaboration migration |
CN108446711A (en) * | 2018-02-01 | 2018-08-24 | 南京邮电大学 | A kind of Software Defects Predict Methods based on transfer learning |
CN111198820A (en) * | 2020-01-02 | 2020-05-26 | 南京邮电大学 | Cross-project software defect prediction method based on shared hidden layer self-encoder |
CN113157564A (en) * | 2021-03-17 | 2021-07-23 | 江苏师范大学 | Cross-project defect prediction method based on feature distribution alignment and neighborhood instance selection |
CN113419948A (en) * | 2021-06-17 | 2021-09-21 | 北京邮电大学 | Method for predicting defects of deep learning cross-project software based on GAN network |
CN114328174A (en) * | 2021-11-10 | 2022-04-12 | 三维通信股份有限公司 | Multi-view software defect prediction method and system based on counterstudy |
CN114548152A (en) * | 2022-01-17 | 2022-05-27 | 上海交通大学 | Method for predicting residual life of marine sliding bearing based on transfer learning |
KR20230122370A (en) * | 2022-02-14 | 2023-08-22 | 한국과학기술원 | Method and system for predicting heterogeneous defect through correlation-based selection of multiple source projects and ensemble learning |
CN114564410A (en) * | 2022-03-21 | 2022-05-31 | 南通大学 | Software defect prediction method based on class level source code similarity |
CN114968774A (en) * | 2022-05-17 | 2022-08-30 | 北京航空航天大学 | Multi-source heterogeneous cross-project software defect prediction method |
CN115293057A (en) * | 2022-10-10 | 2022-11-04 | 深圳先进技术研究院 | Wind driven generator fault prediction method based on multi-source heterogeneous data |
CN116756041A (en) * | 2023-07-19 | 2023-09-15 | 中山大学 | Code defect prediction and positioning method and device, storage medium and computer equipment |
CN117056226A (en) * | 2023-08-18 | 2023-11-14 | 郑州轻工业大学 | Cross-project software defect number prediction method based on transfer learning |
Non-Patent Citations (4)
Title |
---|
Cross-Project Defect Prediction Considering Multiple Data Distribution Simultaneously;Yu Zhao,Yi zhu,Qiao Yu , Xiaoying Chen;SYMMETRY-BASEL;20220217;第14卷(第2期);全文 * |
一种基于领域适配的跨项目软件缺陷预测方法;陈曙;叶俊民;刘童;;软件学报;20200215(02);全文 * |
一种采用对抗学习的跨项目缺陷预测方法;邢颖,钱晓萌,管宇,章世豪,赵梦赐,林婉婷;计算机软件及计算机应用;20220609;第33卷(第6期);2097-2112 * |
基于对抗领域自适应的跨项目缺陷预测方法研究;吴国斌;计算机软件及计算机应用;20221216;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117421244A (en) | 2024-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109492230B (en) | Method for extracting insurance contract key information based on interested text field convolutional neural network | |
CN113672931B (en) | Software vulnerability automatic detection method and device based on pre-training | |
CN115482418B (en) | Semi-supervised model training method, system and application based on pseudo-negative labels | |
CN113742733B (en) | Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type | |
CN116910571B (en) | Open-domain adaptation method and system based on prototype comparison learning | |
CN114139676A (en) | Training method of domain adaptive neural network | |
CN116432655B (en) | Method and device for identifying named entities with few samples based on language knowledge learning | |
CN111325264A (en) | Multi-label data classification method based on entropy | |
CN113282714B (en) | Event detection method based on differential word vector representation | |
CN110008699B (en) | Software vulnerability detection method and device based on neural network | |
CN113158777B (en) | Quality scoring method, training method of quality scoring model and related device | |
CN115099358A (en) | Open world target detection training method based on dictionary creation and field self-adaptation | |
CN111191033A (en) | Open set classification method based on classification utility | |
CN117421244B (en) | Multi-source cross-project software defect prediction method, device and storage medium | |
CN111832463A (en) | Deep learning-based traffic sign detection method | |
CN116935411A (en) | Radical-level ancient character recognition method based on character decomposition and reconstruction | |
CN115098681A (en) | Open service intention detection method based on supervised contrast learning | |
Thakur et al. | Offline handwritten mathematical recognition using adversarial learning and transformers | |
CN109543571A (en) | A kind of intelligent recognition and search method of Complex Product abnormity machining feature | |
CN115329872A (en) | Sensitive attribute identification method and device based on comparison learning | |
CN115345248A (en) | Deep learning-oriented data depolarization method and device | |
CN110059180B (en) | Article author identity recognition and evaluation model training method and device and storage medium | |
CN114610953A (en) | Data classification method, device, equipment and storage medium | |
CN114943229B (en) | Multi-level feature fusion-based software defect named entity identification method | |
CN111753084A (en) | Short text feature extraction and classification method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |