CN117421244B - Multi-source cross-project software defect prediction method, device and storage medium - Google Patents

Multi-source cross-project software defect prediction method, device and storage medium Download PDF

Info

Publication number
CN117421244B
CN117421244B CN202311540803.1A CN202311540803A CN117421244B CN 117421244 B CN117421244 B CN 117421244B CN 202311540803 A CN202311540803 A CN 202311540803A CN 117421244 B CN117421244 B CN 117421244B
Authority
CN
China
Prior art keywords
source
project
sample
encoder
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311540803.1A
Other languages
Chinese (zh)
Other versions
CN117421244A (en
Inventor
邢颖
李文瑾
高东
袁军
顾佳伟
赵梦赐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Nsfocus Technologies Group Co Ltd
Original Assignee
Beijing University of Posts and Telecommunications
Nsfocus Technologies Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications, Nsfocus Technologies Group Co Ltd filed Critical Beijing University of Posts and Telecommunications
Priority to CN202311540803.1A priority Critical patent/CN117421244B/en
Publication of CN117421244A publication Critical patent/CN117421244A/en
Application granted granted Critical
Publication of CN117421244B publication Critical patent/CN117421244B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Prevention of errors by analysis, debugging or testing of software
    • G06F11/3604Analysis of software for verifying properties of programs
    • G06F11/3608Analysis of software for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Stored Programmes (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium. The method comprises the following steps: inputting a plurality of source item data sets and a target item data set; extracting sample features for all data sets using an encoder; training a discriminator on the project label after reversing the sample characteristic gradient; calculating the maximum mean value difference of the characteristics of each source item and each target item, taking the correlation between the target sample output by the discriminator and the source sample as an attention score, and taking weighted summation of a plurality of maximum mean value differences as the loss of the encoder; establishing a defect class classifier; integrally training an encoder, a discriminator and a classifier; and performing feature extraction and defect classification on the target item data set by using the trained encoder and classifier. The device comprises an input module, an encoder G, a discriminator D, a classifier C and a gradient inversion module. The method realizes the prediction of the multi-source cross-project software defects, and has high defect identification accuracy through experimental verification.

Description

Multi-source cross-project software defect prediction method, device and storage medium
Technical Field
The invention belongs to the technical field of software testing, and particularly relates to a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium thereof.
Background
Defect prediction is a key problem in the field of software engineering. However, in real-world engineering projects, there are several problems to be solved in order to perform defect class prediction. First, the number of defect categories and defects of a project are limited, and the sample size of defects may not be sufficient to support model training. Second, the distribution of defects in the project exhibits long tail effects. For tail defects, although small in number, serious problems may be caused. Traditional cross-project software bug predictions involve only one source project and one target project. The multi-source cross-project software defect prediction can be performed by utilizing a plurality of source projects and one target project, and has wide practical value. But both are typically weaker than intra-project defect predictions due mainly to differences in features in the source and target projects, and inconsistent distributions.
The existing multi-source cross-project software defect prediction method is low in prediction accuracy and reliability due to the fact that feature distribution in a source project and feature distribution in a target project are different, and the requirement on the multi-source cross-project software defect prediction result is difficult to meet.
Disclosure of Invention
The invention provides a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium thereof, which are used for solving the problem that the characteristic distribution difference of a source project and a target project in the prior art has a great negative influence on a prediction result.
The invention provides a multi-source cross-project software defect prediction method, which comprises the following steps:
Step 1, inputting a plurality of source project data sets and a target project data set;
The data set of the target item or each source item comprises samples of K types of defect categories, and each sample is provided with a defect category label and an item label; the defect type label of the source item sample is known, and the defect type label of the target item is unknown; the item tag is used for marking which source item or target item the sample belongs to;
step 2, extracting features from samples in all project data sets by using an encoder G;
each sample is a source code file, firstly, word segmentation is carried out on a source code to obtain a corresponding token id, and then the token id is input into an encoder G for feature extraction;
step 3, performing gradient inversion operation on the extracted sample characteristics, and training a discriminator D by using the sample characteristics with the gradient inversion;
The input of the discriminator D comprises sample characteristics of all source items and target items after gradient inversion, and the probability of each sample belonging to different items is output; in the training process, the probability that one sample feature of the target item belongs to different source items is obtained as the attention score, and meanwhile, the countermeasure training loss of the discriminator D is calculated;
Step 4, calculating the maximum mean value difference of the sample characteristics of each source item and each target item, and carrying out weighted summation on the maximum mean value difference of the sample characteristics of each source item and each target item by using the attention score to obtain the coding loss of the encoder G;
step 5, a classifier C is established, the input of the classifier C is the sample characteristics of the source item extracted in the step 2, and the probability that the sample belongs to various defect types is output;
Step 6, comprehensively training the encoder G, the discriminator D and the classifier C, and updating model parameters;
loss during comprehensive training is: Wherein/> Is the challenge training penalty of discriminator D,/>Is the coding loss of encoder G,/>Is the classification loss of classifier C; /(I)Update acting on encoder G and discriminator D,/>Update acting on encoder G,/>Updates acting on encoder G and classifier C;
and 7, obtaining a trained encoder G and a classifier C, segmenting a source code file of the target item, inputting the segmented source code file into the encoder G to extract sample characteristics, and inputting the sample characteristics into the classifier C to perform defect type recognition.
Correspondingly, the invention provides a multi-source cross-project software defect prediction device, which comprises the following functional modules:
An input module that receives input of a plurality of source item datasets and a target item dataset; the data set of the target item or each source item comprises samples of K types of defect categories, and each sample is provided with a defect category label and an item label; the defect type label of the source item sample is known, and the defect type label of the target item is unknown; the item tag is used for marking which source item or target item the sample belongs to;
An encoder G for extracting features from samples in all item data sets; each sample is a source code file, firstly, the source code is segmented to obtain a corresponding token id, and then the token id is input into an encoder G for feature extraction;
the gradient inversion module is used for performing gradient inversion operation on the extracted sample characteristics;
A discriminator D for inputting sample characteristics including all source items and target items after gradient inversion and outputting probabilities that each sample belongs to different items;
the classifier C inputs the sample characteristics of the extracted source items and outputs the probability that the sample belongs to various defect types;
wherein, the encoder G, the discriminator D and the classifier C are comprehensively trained, and the loss is reduced Is the challenge training penalty of discriminator D,/>Is the coding loss of encoder G,/>Is the classification loss of classifier C; Update acting on encoder G and discriminator D,/> Update acting on encoder G,/>Updates acting on encoder G and classifier C; coding loss/>, of encoder GThe acquisition mode is as follows: obtaining the probability that one sample feature of the target item corresponds to the different source items by using the discriminator D as an attention score, calculating the maximum mean value difference of the sample features of each source item and target item, and carrying out weighted summation on the maximum mean value difference of the sample features of each source item and target item by using the attention score to obtain the coding loss of the encoder G;
After the trained encoder G and classifier C are obtained, the source code file of the target item is segmented, then the source code file is input into the encoder G to extract sample characteristics, and then the source code file is input into the classifier C to carry out defect type recognition.
Further, the present invention provides a readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements a multi-source cross-project software defect prediction method of the present invention.
The invention has the advantages and positive effects that: the invention discloses a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium thereof, wherein the problem of characteristic distribution difference is relieved by reducing the maximum mean difference of characteristic distribution of source projects and target projects, the project domain correlation is acquired by introducing counterlearning in a discriminator, and the maximum mean difference between different source projects and target projects is weighted by combining the project domain correlation, so that the problem of characteristic distribution difference is further relieved. Based on the comprehensive training model on the basis of relieving the characteristic distribution difference, a trained characteristic extraction encoder and a defect prediction classifier are obtained, and experiments prove that the multi-source cross-project software defect prediction can be realized, and the defect prediction accuracy is high.
Drawings
FIG. 1 is a flow chart of an implementation of the multi-source cross-project software defect prediction method of the present invention.
Detailed Description
The invention will be described in further detail with reference to the drawings and examples.
The invention provides a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium thereof, which are realized based on countermeasure training and attention mechanisms. The method comprises the following steps: inputting a plurality of source item data sets and a target item data set; extracting features for all data sets using an encoder; training a discriminator on the item tags of the source item and the target item by countermeasure training; respectively calculating the maximum mean difference of the characteristics of each source item and each target item; outputting, by the discriminator, a correlation between the target sample and the source sample as a concentration score; weighting the plurality of maximum mean differences in combination with the attention score; training a classifier on the class labels of the source items; training and updating parameters of an encoder, a discriminator and a classifier; inputting a target item data set; classifying on the target item data set by using a classifier; and outputting the classification result.
As shown in FIG. 1, the multi-source cross-project software defect prediction method implemented by the embodiment of the invention comprises the following steps S200 to S216.
S200, inputting a source project data set and a target project data set.
In the present invention, the source item is a software item for which a software defect is known, and the category label Y of the source item dataset X s is known. The target item is a software item that needs to be judged for defects, and the category label Y t of the target item dataset X t is unknown. Furthermore, the item tags d of the source item and the target item are considered to exist, i.e. each sample is known to belong to either the source item or the target item. Training is performed using source item data of defective category labels and item labels, and target items of non-defective category labels and item labels as training sets.
In one scenario of the present invention, a set of M source item datasets S 1,S2,…SM are provided, where the j-th source item S j is a datasetRepresentation, wherein/>Is the number of samples contained in the source item S *,/>A markup sample representing a source item S j,/>And representing the defect type label corresponding to the sample. Also given a target item data set T, expressed asWherein/>Is the i-th unlabeled specimen whose defect class label Y t is unknown, and n t is the number of specimens in the target item T. Sample number/>, of all source items marking defectsAll samples of the source and target samples n=n s+nt, and all samples are labeled with item label d i to obtain sample set/>Item tag d i is a vector of M+1, if sample X i belongs to source item S j, then the data bit of item tag d i corresponding to source item S j takes a value of 1, and the rest are 0. Each source item dataset S j, j e1, m contains the same defect categories as in the target item dataset T, i.e. Y t consists of the same source domain as any/> The same defect class composition. For example, each of the source item dataset and the target item dataset contains a sample of 44 types of defects.
Similar to single source project software defect identification, Y t is not available during the training process and is only used for evaluation. The object of the invention is therefore to train a model with minimal prediction error in the target domain using the labeled source data. Note that the distribution of samples in the source and target domains is different, and thus, a classifier trained on the source domain typically does not predict the target domain well.
S201, extracting features for all item data sets by the encoder G.
In the source and target item data sets, there is a multidimensional feature space in which feature extraction is performed on samples in each source and target item using a pre-trained model CodeT based on a transducer architecture as encoder G. In the embodiment of the invention, the source codes of the samples of the source item and the target item are used as input, the source codes are firstly segmented to obtain the corresponding token ids, then the token ids are used as the input of CodeT, and the CodeT is used for extracting the characteristics. The purpose of feature extraction is to classify the sample in a subsequent step.
In the embodiment of the invention, each sample in the source item and the target item is a source code file, firstly, the code file is segmented, then, the feature is extracted through CodeT, and a feature is extracted for each code file contained in the source item and the target item. The length of the extracted features of each sample may be set to be the same as the number of extracted features of each item data set for convenience of subsequent processing. If the number of code files contained in the source item data set is different from the number of code files contained in the target item data set, repeated feature extraction can be performed on some code files in the items with small numbers so as to ensure that the feature numbers output by the source item and the target item are the same.
S202, performing gradient inversion operation on the extracted features.
And carrying out gradient inversion operation on the sample characteristics of the extracted source item and target item. The feature of gradient inversion automatically reverses the gradient direction during the back propagation. The gradient inversion layer exists between the feature extractor G and the discriminator D, and in the back propagation process, the gradient of the domain classification loss of the domain classifier is automatically inverted before being back propagated to the parameters of the feature extractor, so that the countering loss is realized. After the operation of step S202, sample features of the source item and the target item and sample features after gradient inversion are obtained.
S203, training the discriminator D by using the item target.
In the present invention, the item tags of the source item and the target item refer to which item the sample comes from, and the identifier can be provided with the capability of identifying which item the sample comes from by training the identifier through the item tag. The present invention uses only the gradient-inverted sample features to train the discriminator D. The input of the discriminator D contains sample features of M source items and target items, and outputs item tags corresponding to the samples. Each item tag is a vector of dimension m+1.
In the embodiment of the invention, the discriminator D is realized by a full-connection layer, and the loss of the discriminator of the characteristics which are not subjected to gradient inversion is thatWhere n represents the total number of samples of the source and target items, D i represents the item tag of sample X i, G () represents the encoder, D () represents the discriminator, superscript/>Representing the transpose.
In the embodiment of the invention, only the gradient inversion feature is used to train the discriminator D. Inputting the sample characteristics after gradient inversion into the discriminator D, the discriminator D trains towards the direction of erroneously discriminating from which item the sample comes, thereby achieving the purpose of confusing the discriminator D, and smoothing the domain correlation obtained by the discriminator D, wherein the challenge training loss of the discriminator D is expressed as:
for parameter λ, it is gradually changed from 0 to 1 using the following formula:
The parameter p increases with the number of training rounds, p.epsilon.0, 1.
In the embodiment of the invention, during training, for example, 7 source items and 1 target item are shared, 1 item in the 8 items is sequentially taken as the target item, the rest 7 items are the source items, each batch takes 1 sample feature from the 8 items, the discriminator D outputs 8-dimension item labels corresponding to 8 samples, and the value in each dimension represents the probability that the sample belongs to the corresponding item.
S204, obtaining the attention scores of the target item and the source item.
By discriminating a sample of the target item by the discriminator D, the similarity of the target item and the source item can be identified as a attention score for the purpose of weighting the loss at a later time. The probability that one sample feature of the target item corresponds to a different source item is obtained as the attention score. The output w of discriminator D is the probability that the input target sample X i belongs to a different source item, w=softmax (D (G (X i))). A larger probability means that the input target sample is more similar to the source item and vice versa. Taking seven source items and one target item as an example, assuming that for a certain target item sample, the discriminator D outputs probabilities w= [0.1,0.6,0.1,0.1,0.05,0.05,0] of corresponding 7 source items, the similarity between the input target sample and the second source item is the highest, and the similarity between the input target sample and the last item is the lowest.
S205, obtaining the countermeasure training loss.
In S202, the gradient inversion decreases the gradient that should have been increased at the time of updating, and the gradient that should have been decreased increases, thereby confusing the discrimination capability of the discriminator D for the target sample, and extracting more domain invariant features.
From the prediction results of the sample predictions for the target item, which belong to different source items, classification losses are calculated by S204 and S205, where classification losses are calculated using cross entropy losses.
S206, calculating the maximum mean difference for the characteristics of each source item and each target item.
For each source item, a maximum mean difference of the sample features of the source item and the sample features of the target item is calculated as follows:
wherein, Representing the maximum mean difference between the source item S j and the target item T,/>For the sample number of the source item S j, n t is the sample number of the target item,/>For the ith sample of source item S j,/>An h sample representing the target item, gaussian kernel/>Sigma represents the standard deviation of the gaussian distribution.
Reducing the maximum mean difference between the source item and the target item may reduce the feature distribution difference between the source item and the target item.
S207, weighting the plurality of maximum mean differences.
In combination with the attention score obtained in S204, the plurality of maximum mean differences calculated in S206 are weighted, which has the advantage that the influence of similar source items on target items is amplified, and the influence of irrelevant source items on target items is reduced, so that:
Where w j denotes the similarity of the target sample and the source item j, Representing a weighted sum of the maximum mean differences for all source and target items.
S208, the weighted maximum mean difference loss is obtained as the loss of the encoder G.
In S207, the weighted maximum mean difference is obtained as a loss term for the encoder GFurther reducing the feature distribution difference of the source item and the target item.
S209, training a classifier C for the class labels of the source items.
In the present invention, in addition to aligning the feature distribution of the target item and the source item, the representation learning on each source item is also important to the classification result. And training a classifier C by combining the sample characteristics of the source item extracted in the step S201 and the class labels of the samples, wherein the input of the classifier C is the sample characteristics of the source item which are not subjected to gradient inversion, and the input of the classifier C is a specific defect class.
S210, obtaining the classification loss.
The classification loss obtained by classifying in S209 is used as a loss term, i.eWherein Y i represents sample/>If the defect class has 44 classes, the defect class vector is a 44-dimensional vector, and each data bit corresponds to a probability of a defect.
S211, training and updating parameters of the encoder G, the discriminator D and the classifier C.
In the present invention, a complete model comprising encoder G, discriminator D and classifier C is trained by three losses from S205, S208 and S210, respectively, i.e
In the present invention, lossUpdate related to encoder G and discriminator D,/>Involving updating of encoder G,/>To the updating of the encoder G and the classifier C. The training data set is used for training the whole model, and model parameters are updated.
S212, outputting the trained encoder G and the classifier C.
In the test stage of the present invention, the target samples are tested using the encoder G and the classifier C trained in steps S201 to S211.
S213, inputting the target item data set.
In practice, the class label of the target item is empty, and in the present invention, the class label of the target item dataset is considered to be present but not labeled. After the source code file in the target item is segmented, the encoder G extracts the characteristics.
S214, predicting on the target item data set by using the classifier C.
The classifier C classifies the samples in the target item dataset based on the information learned on the target item dataset.
S215, calculating a prediction result of the classifier C. And summarizing the results of the classifier C, and counting the number according to different defect categories.
S216, outputting a classification result.
Correspondingly, the invention can also realize a multi-source cross-project software defect prediction device according to the method, which comprises the following steps: an input module, an encoder G, a discriminator D, a classifier C and a gradient inversion module. Wherein the input module receives input of a plurality of source item datasets and a target item dataset; the data set of the target item or each source item comprises samples of K types of defect categories, and each sample is provided with a defect category label and an item label; the defect class labels of the source item samples are known and the defect class labels of the target items are unknown. The encoder G extracts characteristics from samples in all project data sets; each sample is a source code file, firstly, the source code is segmented to obtain a corresponding token id, and then the token id is input into the encoder G for feature extraction. The gradient inversion module performs gradient inversion operation on the extracted sample features. The discriminator D trains by using sample characteristics including all source items and target items after gradient inversion, and outputs probabilities that each sample belongs to different items. The classifier C inputs the extracted sample features of the source item and outputs probabilities that the samples belong to various defect types.
The encoder G, the discriminator D and the classifier C are comprehensively trained as a whole, model parameters are updated, and a Loss function Loss during training is as described in S211. The encoder G uses the attention score to reduce the maximum mean difference of the source item and the target item so as to achieve the purpose of reducing the characteristic distribution difference of the source item and the target item. Coding loss of encoder GThe acquisition mode is as follows: obtaining the probability that one sample feature of the target item corresponds to the belonging to different source items by using the discriminator D as an attention score, calculating the maximum mean difference of the sample features of each source item and target item, and weighting and summing the maximum mean differences of the sample features of each source item and target item by using the attention score as the coding loss/>, of the encoder GAfter training, the trained encoder G and classifier C are output.
And (3) utilizing the trained encoder G and classifier C, segmenting the source code file of the target item, inputting the segmented source code file into the encoder G to extract sample characteristics, and inputting the sample characteristics into the classifier C to perform defect type recognition.
Furthermore, the present invention is based on the above method, and further realizes a readable storage medium having a computer program stored thereon, which when executed by a processor, implements a multi-source cross-project software defect prediction method as described above.
The multi-source cross-project software defect prediction method is subjected to experimental verification, and the defect classification accuracy ACC index is shown in table 1.
TABLE 1 accuracy of multi-source cross-project software defect prediction using the method of the present invention
As shown in table 1, apache JMeter, apache Jena, APACHE LENYA, … … JTree of the first row are 8 items. In the experiment, each item is used as a target item, the rest 7 are used as source items, the defect classification is carried out by using the method of the invention, the accuracy of sample defect classification of the target item in each experiment is shown in the second row in the table 1, and the result shows that the method of the invention realizes the prediction of multi-source cross-item software defects, ACC is above 0.9, and the prediction accuracy of sample defects of the target item is high.
Other than the technical features described in the specification, all are known to those skilled in the art. Descriptions of well-known components and well-known techniques are omitted so as to not unnecessarily obscure the present application. The embodiments described in the above examples are not intended to represent all the embodiments consistent with the present application, and various modifications or variations may be made by those skilled in the art without the need for inventive effort on the basis of the technical solutions of the present application while remaining within the scope of the present application.

Claims (6)

1.一种多源跨项目软件缺陷预测方法,其特征在于,包括如下步骤:1. A multi-source cross-project software defect prediction method, characterized in that it includes the following steps: 步骤1,输入多个源项目数据集和一个目标项目数据集;Step 1: input multiple source project datasets and one target project dataset; 目标项目或每个源项目的数据集中均包含K类缺陷类别的样本,每个样本具有缺陷类别标签和项目标签;源项目样本的缺陷类别标签已知,目标项目的缺陷类别标签未知;项目标签用于标记样本属于哪个源项目或者目标项目;The dataset of the target project or each source project contains samples of K defect categories, each of which has a defect category label and a project label. The defect category labels of the source project samples are known, while the defect category labels of the target project are unknown. The project label is used to mark which source project or target project the sample belongs to. 步骤2,使用编码器G对所有项目数据集中的样本提取特征;Step 2: Use encoder G to extract features from samples in all project datasets; 每个样本为一个源代码文件,先对源代码进行分词得到对应的token id,再将tokenid输入编码器G,进行特征提取;Each sample is a source code file. First, the source code is segmented to obtain the corresponding token id, and then the token id is input into the encoder G for feature extraction; 步骤3,对提取的样本特征进行梯度反转操作,利用梯度反转的样本特征训练鉴别器D;Step 3, perform a gradient reversal operation on the extracted sample features, and use the gradient-reversed sample features to train the discriminator D; 鉴别器D的输入包含梯度反转后的所有源项目和目标项目的样本特征,输出为各样本属于不同项目的概率;在训练过程中,获取目标项目的一个样本特征对应属于不同源项目的概率作为注意力得分,同时计算鉴别器D的对抗训练损失;The input of the discriminator D contains the sample features of all source items and target items after gradient inversion, and the output is the probability that each sample belongs to a different item. During the training process, the probability that a sample feature of the target item belongs to a different source item is obtained as the attention score, and the adversarial training loss of the discriminator D is calculated at the same time. 鉴别器D使用一个全连接层实现,鉴别器D的对抗训练损失表示为:The discriminator D is implemented using a fully connected layer, and the adversarial training loss of the discriminator D is Expressed as: 其中,n表示所有源项目和目标项目的所有样本特征数量,Xi表示其中的第i个样本特征,di是第i个样本的项目标签,上角标表示转置;G表示编码器,D表示鉴别器;参数λ使用如下公式从0逐渐变化到1,/>参数p随着训练轮数增加,p∈[0,1];Where n represents the number of sample features of all source items and target items, Xi represents the i-th sample feature, di is the item label of the i-th sample, and the superscript represents transpose; G represents encoder, D represents discriminator; parameter λ gradually changes from 0 to 1 using the following formula,/> The parameter p increases with the number of training rounds, p∈[0,1]; 步骤4,计算每个源项目和目标项目的样本特征的最大均值差异,使用所述注意力得分对各源项目和目标项目的样本特征的最大均值差异进行加权求和,作为编码器G的编码损失;Step 4, calculating the maximum mean difference of the sample features of each source item and the target item, and using the attention score to perform a weighted summation of the maximum mean difference of the sample features of each source item and the target item as the encoding loss of the encoder G; 设第j个源项目Sj中包含个样本数量,则计算源项目Sj和目标项目T的样本特征的最大均值差异/>如下:Suppose the jth source item Sj contains The number of samples is calculated, and the maximum mean difference of the sample features of the source project S j and the target project T is calculated/> as follows: 其中,nt为目标项目T的样本数量,为源项目Sj的第i个样本,/>表示目标项目T的第h个样本,高斯核函数/>σ表示高斯分布的标准差,X、X′为两个样本特征;Among them, n t is the number of samples of target item T, is the i-th sample of source item S j ,/> Represents the hth sample of target item T, Gaussian kernel function/> σ represents the standard deviation of Gaussian distribution, X and X′ are two sample features; 编码器G的编码损失根据所有源项目和目标项目的最大均值差异进行加权求和得到,如下:The encoding loss of encoder G The weighted summation is based on the maximum mean difference between all source items and target items, as follows: 其中,表示对所有源项目和目标项目的最大均值差异进行加权求和;S表示源项目集合,包含M个源项目;wj是目标项目对于源项目Sj的注意力得分;in, represents the weighted sum of the maximum mean differences between all source items and target items; S represents the source item set, which contains M source items; wj is the attention score of the target item for the source item Sj ; 步骤5,建立分类器C,分类器C的输入为步骤2提取的源项目的样本特征,输出样本属于各种缺陷类型的概率;Step 5, establish classifier C, the input of classifier C is the sample features of the source project extracted in step 2, and the output is the probability that the sample belongs to various defect types; 分类器C的分类损失如下:Classification loss of classifier C as follows: 其中Yi表示样本/>的缺陷类别标签,C是分类器; Where Yi represents the sample/> The defect category label, C is the classifier; 步骤6,综合训练编码器G、鉴别器D和分类器C,更新模型参数;Step 6, comprehensively train the encoder G, discriminator D and classifier C, and update the model parameters; 综合训练时的损失Loss为:其中,/>是鉴别器D的对抗训练损失,/>是编码器G的编码损失,/>是分类器C的分类损失;/>作用于编码器G和鉴别器D的更新,/>作用于编码器G的更新,/>作用于编码器G和分类器C的更新;The loss during comprehensive training is: Among them,/> is the adversarial training loss of the discriminator D,/> is the coding loss of encoder G, /> is the classification loss of classifier C; /> Updates applied to encoder G and discriminator D,/> Updates acting on encoder G, /> Updates to the encoder G and classifier C; 步骤7,获得训练好的编码器G和分类器C,将目标项目的源代码文件进行分词后输入编码器G提取样本特征,然后输入分类器C进行缺陷类别识别。Step 7: Obtain the trained encoder G and classifier C, segment the source code file of the target project, input it into the encoder G to extract sample features, and then input it into the classifier C to identify the defect category. 2.根据权利要求1所述的方法,其特征在于,所述的步骤2中,使用基于Transformer架构的预训练模型CodeT5作为编码器G。2. The method according to claim 1 is characterized in that in the step 2, a pre-trained model CodeT5 based on the Transformer architecture is used as the encoder G. 3.根据权利要求1或2所述的方法,其特征在于,所述的步骤2中,设置提取的每个样本的特征长度一样;设置每个源项目数据集和目标项目数据集提取的特征数量相同,对于样本少的数据集进行样本的重复特征提取,以增加数据集输出的特征数量。3. The method according to claim 1 or 2 is characterized in that, in the step 2, the feature length of each extracted sample is set to be the same; the number of features extracted from each source project data set and the target project data set is set to be the same, and repeated feature extraction of samples is performed for data sets with fewer samples to increase the number of features output by the data set. 4.根据权利要求1所述的方法,其特征在于,所述的步骤3中,在训练过程中,每轮训练从各源项目和目标项目中取1个样本特征作为鉴别器D的输入,鉴别器D输出对应各样本的项目标签,项目标签是M+1维度的向量,M为源项目数量,每个维度上的值表示样本属于相应项目的概率。4. The method according to claim 1 is characterized in that in the step 3, during the training process, one sample feature is taken from each source project and target project in each round of training as the input of the discriminator D, and the discriminator D outputs the project label corresponding to each sample. The project label is a vector of M+1 dimensions, where M is the number of source projects and the value on each dimension represents the probability that the sample belongs to the corresponding project. 5.一种多源跨项目软件缺陷预测装置,其特征在于,该装置实现如权利要求1~4任一项所述的一种多源跨项目软件缺陷预测方法,该装置包括如下功能模块:5. A multi-source cross-project software defect prediction device, characterized in that the device implements a multi-source cross-project software defect prediction method according to any one of claims 1 to 4, and the device comprises the following functional modules: 输入模块,接收多个源项目数据集和一个目标项目数据集的输入;目标项目或每个源项目的数据集中均包含K类缺陷类别的样本,每个样本具有缺陷类别标签和项目标签;源项目样本的缺陷类别标签已知,目标项目的缺陷类别标签未知;项目标签用于标记样本属于哪个源项目或者目标项目;The input module receives inputs of multiple source project data sets and one target project data set; the target project or each source project data set contains samples of K defect categories, each sample has a defect category label and a project label; the defect category label of the source project sample is known, while the defect category label of the target project is unknown; the project label is used to mark which source project or target project the sample belongs to; 编码器G,用于对所有项目数据集中的样本提取特征;每个样本为一个源代码文件,先对源代码进行分词得到对应的token id,再将token id输入编码器G进行特征提取;Encoder G is used to extract features from samples in all project data sets. Each sample is a source code file. The source code is first segmented to obtain the corresponding token id, and then the token id is input into encoder G for feature extraction. 梯度反转模块,用于对提取的样本特征进行梯度反转操作;A gradient reversal module is used to perform a gradient reversal operation on the extracted sample features; 鉴别器D,输入包含梯度反转后的所有源项目和目标项目的样本特征,输出各样本属于不同项目的概率;The discriminator D takes as input the sample features of all source items and target items after gradient inversion, and outputs the probability that each sample belongs to a different item; 分类器C,输入提取的源项目的样本特征,输出样本属于各种缺陷类型的概率;Classifier C, which inputs the sample features of the extracted source items and outputs the probability that the sample belongs to various defect types; 其中,将编码器G、鉴别器D和分类器C进行综合训练,损失 是鉴别器D的对抗训练损失,/>是编码器G的编码损失,/>是分类器C的分类损失;/>作用于编码器G和鉴别器D的更新,/>作用于编码器G的更新,/>作用于编码器G和分类器C的更新;编码器G的编码损失/>获取方式是:使用鉴别器D获取目标项目的一个样本特征对应属于不同源项目的概率作为注意力得分,计算每个源项目和目标项目的样本特征的最大均值差异,使用所述注意力得分对各源项目和目标项目的样本特征的最大均值差异进行加权求和,作为编码器G的编码损失;Among them, the encoder G, discriminator D and classifier C are trained comprehensively, and the loss is the adversarial training loss of the discriminator D,/> is the coding loss of encoder G, /> is the classification loss of classifier C; /> Updates applied to encoder G and discriminator D,/> Updates acting on encoder G, /> Updates applied to encoder G and classifier C; encoding loss of encoder G/> The acquisition method is: use the discriminator D to obtain the probability that a sample feature of the target project belongs to different source projects as the attention score, calculate the maximum mean difference between the sample features of each source project and the target project, and use the attention score to perform weighted summation of the maximum mean difference between the sample features of each source project and the target project as the encoding loss of the encoder G; 获得训练好的编码器G和分类器C后,将目标项目的源代码文件进行分词后输入编码器G提取样本特征,然后输入分类器C进行缺陷类别识别。After obtaining the trained encoder G and classifier C, the source code file of the target project is segmented and input into the encoder G to extract sample features, and then input into the classifier C for defect category identification. 6.一种可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1-2中任一项所述的一种多源跨项目软件缺陷预测方法。6. A readable storage medium having a computer program stored thereon, characterized in that when the program is executed by a processor, a multi-source cross-project software defect prediction method as described in any one of claims 1-2 is implemented.
CN202311540803.1A 2023-11-17 2023-11-17 Multi-source cross-project software defect prediction method, device and storage medium Active CN117421244B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311540803.1A CN117421244B (en) 2023-11-17 2023-11-17 Multi-source cross-project software defect prediction method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311540803.1A CN117421244B (en) 2023-11-17 2023-11-17 Multi-source cross-project software defect prediction method, device and storage medium

Publications (2)

Publication Number Publication Date
CN117421244A CN117421244A (en) 2024-01-19
CN117421244B true CN117421244B (en) 2024-05-24

Family

ID=89526493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311540803.1A Active CN117421244B (en) 2023-11-17 2023-11-17 Multi-source cross-project software defect prediction method, device and storage medium

Country Status (1)

Country Link
CN (1) CN117421244B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304316A (en) * 2017-12-25 2018-07-20 浙江工业大学 A kind of Software Defects Predict Methods based on collaboration migration
CN108446711A (en) * 2018-02-01 2018-08-24 南京邮电大学 A kind of Software Defects Predict Methods based on transfer learning
CN111198820A (en) * 2020-01-02 2020-05-26 南京邮电大学 A Cross-Project Software Defect Prediction Method Based on Shared Hidden Autoencoder
CN113157564A (en) * 2021-03-17 2021-07-23 江苏师范大学 Cross-project defect prediction method based on feature distribution alignment and neighborhood instance selection
CN113419948A (en) * 2021-06-17 2021-09-21 北京邮电大学 Method for predicting defects of deep learning cross-project software based on GAN network
CN114328174A (en) * 2021-11-10 2022-04-12 三维通信股份有限公司 Multi-view software defect prediction method and system based on counterstudy
CN114548152A (en) * 2022-01-17 2022-05-27 上海交通大学 A transfer learning-based residual life prediction method for marine sliding bearings
CN114564410A (en) * 2022-03-21 2022-05-31 南通大学 Software defect prediction method based on class level source code similarity
CN114968774A (en) * 2022-05-17 2022-08-30 北京航空航天大学 Multi-source heterogeneous cross-project software defect prediction method
CN115293057A (en) * 2022-10-10 2022-11-04 深圳先进技术研究院 Wind driven generator fault prediction method based on multi-source heterogeneous data
KR20230122370A (en) * 2022-02-14 2023-08-22 한국과학기술원 Method and system for predicting heterogeneous defect through correlation-based selection of multiple source projects and ensemble learning
CN116756041A (en) * 2023-07-19 2023-09-15 中山大学 Code defect prediction and positioning method and device, storage medium and computer equipment
CN117056226A (en) * 2023-08-18 2023-11-14 郑州轻工业大学 Cross-project software defect number prediction method based on transfer learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11087174B2 (en) * 2018-09-25 2021-08-10 Nec Corporation Deep group disentangled embedding and network weight generation for visual inspection
EP3767536B1 (en) * 2019-07-17 2025-02-19 Naver Corporation Latent code for unsupervised domain adaptation
US11580425B2 (en) * 2020-06-30 2023-02-14 Microsoft Technology Licensing, Llc Managing defects in a model training pipeline using synthetic data sets associated with defect types
CN112508300B (en) * 2020-12-21 2023-04-18 北京百度网讯科技有限公司 Method for establishing risk prediction model, regional risk prediction method and corresponding device

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304316A (en) * 2017-12-25 2018-07-20 浙江工业大学 A kind of Software Defects Predict Methods based on collaboration migration
CN108446711A (en) * 2018-02-01 2018-08-24 南京邮电大学 A kind of Software Defects Predict Methods based on transfer learning
CN111198820A (en) * 2020-01-02 2020-05-26 南京邮电大学 A Cross-Project Software Defect Prediction Method Based on Shared Hidden Autoencoder
CN113157564A (en) * 2021-03-17 2021-07-23 江苏师范大学 Cross-project defect prediction method based on feature distribution alignment and neighborhood instance selection
CN113419948A (en) * 2021-06-17 2021-09-21 北京邮电大学 Method for predicting defects of deep learning cross-project software based on GAN network
CN114328174A (en) * 2021-11-10 2022-04-12 三维通信股份有限公司 Multi-view software defect prediction method and system based on counterstudy
CN114548152A (en) * 2022-01-17 2022-05-27 上海交通大学 A transfer learning-based residual life prediction method for marine sliding bearings
KR20230122370A (en) * 2022-02-14 2023-08-22 한국과학기술원 Method and system for predicting heterogeneous defect through correlation-based selection of multiple source projects and ensemble learning
CN114564410A (en) * 2022-03-21 2022-05-31 南通大学 Software defect prediction method based on class level source code similarity
CN114968774A (en) * 2022-05-17 2022-08-30 北京航空航天大学 Multi-source heterogeneous cross-project software defect prediction method
CN115293057A (en) * 2022-10-10 2022-11-04 深圳先进技术研究院 Wind driven generator fault prediction method based on multi-source heterogeneous data
CN116756041A (en) * 2023-07-19 2023-09-15 中山大学 Code defect prediction and positioning method and device, storage medium and computer equipment
CN117056226A (en) * 2023-08-18 2023-11-14 郑州轻工业大学 Cross-project software defect number prediction method based on transfer learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Cross-Project Defect Prediction Considering Multiple Data Distribution Simultaneously;Yu Zhao,Yi zhu,Qiao Yu , Xiaoying Chen;SYMMETRY-BASEL;20220217;第14卷(第2期);全文 *
一种基于领域适配的跨项目软件缺陷预测方法;陈曙;叶俊民;刘童;;软件学报;20200215(02);全文 *
一种采用对抗学习的跨项目缺陷预测方法;邢颖,钱晓萌,管宇,章世豪,赵梦赐,林婉婷;计算机软件及计算机应用;20220609;第33卷(第6期);2097-2112 *
基于对抗领域自适应的跨项目缺陷预测方法研究;吴国斌;计算机软件及计算机应用;20221216;全文 *

Also Published As

Publication number Publication date
CN117421244A (en) 2024-01-19

Similar Documents

Publication Publication Date Title
CN111738172B (en) Cross-domain object re-identification method based on feature adversarial learning and self-similarity clustering
CN114860930A (en) A text classification method, device and storage medium
CN111325264A (en) Multi-label data classification method based on entropy
CN115482418B (en) Semi-supervised model training method, system and application based on pseudo-negative labels
CN118113849B (en) Information consulting service system and method based on big data
CN112966068A (en) Resume identification method and device based on webpage information
CN113158777B (en) Quality scoring method, training method of quality scoring model and related device
CN116910571B (en) Open-domain adaptation method and system based on prototype comparison learning
CN110348227A (en) A kind of classification method and system of software vulnerability
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN111191033A (en) Open set classification method based on classification utility
CN113987174A (en) Core statement extraction method, system, equipment and storage medium for classification label
CN113282714B (en) An event detection method based on discriminative word vector representation
CN117516937A (en) Unknown fault detection method of rolling bearing based on multi-modal feature fusion enhancement
CN116663539A (en) Chinese entity and relationship joint extraction method and system based on RoBERTa and pointer network
CN114943229A (en) Software defect named entity identification method based on multi-level feature fusion
CN112417147B (en) Method and device for selecting training samples
CN119128076A (en) A judicial case retrieval method and system based on course learning
CN119067683A (en) A fake review detection method integrating text features and aspect features
CN117421244B (en) Multi-source cross-project software defect prediction method, device and storage medium
Singh et al. Facial Emotion Detection Using CNN-Based Neural Network
CN110909547A (en) Judicial entity identification method based on improved deep learning
CN117556152A (en) Video social relationship recognition method and system based on salient information and tag correlation mining
CN116186423A (en) Personality detection method based on social text and links
Thakur et al. Offline handwritten mathematical recognition using adversarial learning and transformers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant