CN117421244B

CN117421244B - Multi-source cross-project software defect prediction method, device and storage medium

Info

Publication number: CN117421244B
Application number: CN202311540803.1A
Authority: CN
Inventors: 邢颖; 李文瑾; 高东; 袁军; 顾佳伟; 赵梦赐
Original assignee: Beijing University of Posts and Telecommunications; Nsfocus Technologies Group Co Ltd
Current assignee: Beijing University of Posts and Telecommunications; Nsfocus Technologies Group Co Ltd
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-05-24
Anticipated expiration: 2043-11-17
Also published as: CN117421244A

Abstract

The invention discloses a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium. The method comprises the following steps: inputting a plurality of source item data sets and a target item data set; extracting sample features for all data sets using an encoder; training a discriminator on the project label after reversing the sample characteristic gradient; calculating the maximum mean value difference of the characteristics of each source item and each target item, taking the correlation between the target sample output by the discriminator and the source sample as an attention score, and taking weighted summation of a plurality of maximum mean value differences as the loss of the encoder; establishing a defect class classifier; integrally training an encoder, a discriminator and a classifier; and performing feature extraction and defect classification on the target item data set by using the trained encoder and classifier. The device comprises an input module, an encoder G, a discriminator D, a classifier C and a gradient inversion module. The method realizes the prediction of the multi-source cross-project software defects, and has high defect identification accuracy through experimental verification.

Description

Multi-source cross-project software defect prediction method, device and storage medium

Technical Field

The invention belongs to the technical field of software testing, and particularly relates to a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium thereof.

Background

Defect prediction is a key problem in the field of software engineering. However, in real-world engineering projects, there are several problems to be solved in order to perform defect class prediction. First, the number of defect categories and defects of a project are limited, and the sample size of defects may not be sufficient to support model training. Second, the distribution of defects in the project exhibits long tail effects. For tail defects, although small in number, serious problems may be caused. Traditional cross-project software bug predictions involve only one source project and one target project. The multi-source cross-project software defect prediction can be performed by utilizing a plurality of source projects and one target project, and has wide practical value. But both are typically weaker than intra-project defect predictions due mainly to differences in features in the source and target projects, and inconsistent distributions.

The existing multi-source cross-project software defect prediction method is low in prediction accuracy and reliability due to the fact that feature distribution in a source project and feature distribution in a target project are different, and the requirement on the multi-source cross-project software defect prediction result is difficult to meet.

Disclosure of Invention

The invention provides a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium thereof, which are used for solving the problem that the characteristic distribution difference of a source project and a target project in the prior art has a great negative influence on a prediction result.

The invention provides a multi-source cross-project software defect prediction method, which comprises the following steps:

Step 1, inputting a plurality of source project data sets and a target project data set;

The data set of the target item or each source item comprises samples of K types of defect categories, and each sample is provided with a defect category label and an item label; the defect type label of the source item sample is known, and the defect type label of the target item is unknown; the item tag is used for marking which source item or target item the sample belongs to;

step 2, extracting features from samples in all project data sets by using an encoder G;

each sample is a source code file, firstly, word segmentation is carried out on a source code to obtain a corresponding token id, and then the token id is input into an encoder G for feature extraction;

step 3, performing gradient inversion operation on the extracted sample characteristics, and training a discriminator D by using the sample characteristics with the gradient inversion;

The input of the discriminator D comprises sample characteristics of all source items and target items after gradient inversion, and the probability of each sample belonging to different items is output; in the training process, the probability that one sample feature of the target item belongs to different source items is obtained as the attention score, and meanwhile, the countermeasure training loss of the discriminator D is calculated;

Step 4, calculating the maximum mean value difference of the sample characteristics of each source item and each target item, and carrying out weighted summation on the maximum mean value difference of the sample characteristics of each source item and each target item by using the attention score to obtain the coding loss of the encoder G;

step 5, a classifier C is established, the input of the classifier C is the sample characteristics of the source item extracted in the step 2, and the probability that the sample belongs to various defect types is output;

Step 6, comprehensively training the encoder G, the discriminator D and the classifier C, and updating model parameters;

loss during comprehensive training is: Wherein/> Is the challenge training penalty of discriminator D,/>Is the coding loss of encoder G,/>Is the classification loss of classifier C; /(I)Update acting on encoder G and discriminator D,/>Update acting on encoder G,/>Updates acting on encoder G and classifier C;

and 7, obtaining a trained encoder G and a classifier C, segmenting a source code file of the target item, inputting the segmented source code file into the encoder G to extract sample characteristics, and inputting the sample characteristics into the classifier C to perform defect type recognition.

Correspondingly, the invention provides a multi-source cross-project software defect prediction device, which comprises the following functional modules:

An input module that receives input of a plurality of source item datasets and a target item dataset; the data set of the target item or each source item comprises samples of K types of defect categories, and each sample is provided with a defect category label and an item label; the defect type label of the source item sample is known, and the defect type label of the target item is unknown; the item tag is used for marking which source item or target item the sample belongs to;

An encoder G for extracting features from samples in all item data sets; each sample is a source code file, firstly, the source code is segmented to obtain a corresponding token id, and then the token id is input into an encoder G for feature extraction;

the gradient inversion module is used for performing gradient inversion operation on the extracted sample characteristics;

A discriminator D for inputting sample characteristics including all source items and target items after gradient inversion and outputting probabilities that each sample belongs to different items;

the classifier C inputs the sample characteristics of the extracted source items and outputs the probability that the sample belongs to various defect types;

wherein, the encoder G, the discriminator D and the classifier C are comprehensively trained, and the loss is reduced Is the challenge training penalty of discriminator D,/>Is the coding loss of encoder G,/>Is the classification loss of classifier C; Update acting on encoder G and discriminator D,/> Update acting on encoder G,/>Updates acting on encoder G and classifier C; coding loss/>, of encoder GThe acquisition mode is as follows: obtaining the probability that one sample feature of the target item corresponds to the different source items by using the discriminator D as an attention score, calculating the maximum mean value difference of the sample features of each source item and target item, and carrying out weighted summation on the maximum mean value difference of the sample features of each source item and target item by using the attention score to obtain the coding loss of the encoder G;

After the trained encoder G and classifier C are obtained, the source code file of the target item is segmented, then the source code file is input into the encoder G to extract sample characteristics, and then the source code file is input into the classifier C to carry out defect type recognition.

Further, the present invention provides a readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements a multi-source cross-project software defect prediction method of the present invention.

The invention has the advantages and positive effects that: the invention discloses a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium thereof, wherein the problem of characteristic distribution difference is relieved by reducing the maximum mean difference of characteristic distribution of source projects and target projects, the project domain correlation is acquired by introducing counterlearning in a discriminator, and the maximum mean difference between different source projects and target projects is weighted by combining the project domain correlation, so that the problem of characteristic distribution difference is further relieved. Based on the comprehensive training model on the basis of relieving the characteristic distribution difference, a trained characteristic extraction encoder and a defect prediction classifier are obtained, and experiments prove that the multi-source cross-project software defect prediction can be realized, and the defect prediction accuracy is high.

Drawings

FIG. 1 is a flow chart of an implementation of the multi-source cross-project software defect prediction method of the present invention.

Detailed Description

The invention will be described in further detail with reference to the drawings and examples.

The invention provides a multi-source cross-project software defect prediction method, a multi-source cross-project software defect prediction device and a storage medium thereof, which are realized based on countermeasure training and attention mechanisms. The method comprises the following steps: inputting a plurality of source item data sets and a target item data set; extracting features for all data sets using an encoder; training a discriminator on the item tags of the source item and the target item by countermeasure training; respectively calculating the maximum mean difference of the characteristics of each source item and each target item; outputting, by the discriminator, a correlation between the target sample and the source sample as a concentration score; weighting the plurality of maximum mean differences in combination with the attention score; training a classifier on the class labels of the source items; training and updating parameters of an encoder, a discriminator and a classifier; inputting a target item data set; classifying on the target item data set by using a classifier; and outputting the classification result.

As shown in FIG. 1, the multi-source cross-project software defect prediction method implemented by the embodiment of the invention comprises the following steps S200 to S216.

S200, inputting a source project data set and a target project data set.

In the present invention, the source item is a software item for which a software defect is known, and the category label Y of the source item dataset X ^s is known. The target item is a software item that needs to be judged for defects, and the category label Y ^t of the target item dataset X ^t is unknown. Furthermore, the item tags d of the source item and the target item are considered to exist, i.e. each sample is known to belong to either the source item or the target item. Training is performed using source item data of defective category labels and item labels, and target items of non-defective category labels and item labels as training sets.

In one scenario of the present invention, a set of M source item datasets S ₁,S₂,…S_M are provided, where the j-th source item S _j is a datasetRepresentation, wherein/>Is the number of samples contained in the source item S _*,/>A markup sample representing a source item S _j,/>And representing the defect type label corresponding to the sample. Also given a target item data set T, expressed asWherein/>Is the i-th unlabeled specimen whose defect class label Y ^t is unknown, and n _t is the number of specimens in the target item T. Sample number/>, of all source items marking defectsAll samples of the source and target samples n=n _s+n_t, and all samples are labeled with item label d _i to obtain sample set/>Item tag d _i is a vector of M+1, if sample X _i belongs to source item S _j, then the data bit of item tag d _i corresponding to source item S _j takes a value of 1, and the rest are 0. Each source item dataset S _j, j e1, m contains the same defect categories as in the target item dataset T, i.e. Y ^t consists of the same source domain as any/> The same defect class composition. For example, each of the source item dataset and the target item dataset contains a sample of 44 types of defects.

Similar to single source project software defect identification, Y ^t is not available during the training process and is only used for evaluation. The object of the invention is therefore to train a model with minimal prediction error in the target domain using the labeled source data. Note that the distribution of samples in the source and target domains is different, and thus, a classifier trained on the source domain typically does not predict the target domain well.

S201, extracting features for all item data sets by the encoder G.

In the source and target item data sets, there is a multidimensional feature space in which feature extraction is performed on samples in each source and target item using a pre-trained model CodeT based on a transducer architecture as encoder G. In the embodiment of the invention, the source codes of the samples of the source item and the target item are used as input, the source codes are firstly segmented to obtain the corresponding token ids, then the token ids are used as the input of CodeT, and the CodeT is used for extracting the characteristics. The purpose of feature extraction is to classify the sample in a subsequent step.

In the embodiment of the invention, each sample in the source item and the target item is a source code file, firstly, the code file is segmented, then, the feature is extracted through CodeT, and a feature is extracted for each code file contained in the source item and the target item. The length of the extracted features of each sample may be set to be the same as the number of extracted features of each item data set for convenience of subsequent processing. If the number of code files contained in the source item data set is different from the number of code files contained in the target item data set, repeated feature extraction can be performed on some code files in the items with small numbers so as to ensure that the feature numbers output by the source item and the target item are the same.

S202, performing gradient inversion operation on the extracted features.

And carrying out gradient inversion operation on the sample characteristics of the extracted source item and target item. The feature of gradient inversion automatically reverses the gradient direction during the back propagation. The gradient inversion layer exists between the feature extractor G and the discriminator D, and in the back propagation process, the gradient of the domain classification loss of the domain classifier is automatically inverted before being back propagated to the parameters of the feature extractor, so that the countering loss is realized. After the operation of step S202, sample features of the source item and the target item and sample features after gradient inversion are obtained.

S203, training the discriminator D by using the item target.

In the present invention, the item tags of the source item and the target item refer to which item the sample comes from, and the identifier can be provided with the capability of identifying which item the sample comes from by training the identifier through the item tag. The present invention uses only the gradient-inverted sample features to train the discriminator D. The input of the discriminator D contains sample features of M source items and target items, and outputs item tags corresponding to the samples. Each item tag is a vector of dimension m+1.

In the embodiment of the invention, the discriminator D is realized by a full-connection layer, and the loss of the discriminator of the characteristics which are not subjected to gradient inversion is thatWhere n represents the total number of samples of the source and target items, D _i represents the item tag of sample X _i, G () represents the encoder, D () represents the discriminator, superscript/>Representing the transpose.

In the embodiment of the invention, only the gradient inversion feature is used to train the discriminator D. Inputting the sample characteristics after gradient inversion into the discriminator D, the discriminator D trains towards the direction of erroneously discriminating from which item the sample comes, thereby achieving the purpose of confusing the discriminator D, and smoothing the domain correlation obtained by the discriminator D, wherein the challenge training loss of the discriminator D is expressed as:

for parameter λ, it is gradually changed from 0 to 1 using the following formula:

The parameter p increases with the number of training rounds, p.epsilon.0, 1.

In the embodiment of the invention, during training, for example, 7 source items and 1 target item are shared, 1 item in the 8 items is sequentially taken as the target item, the rest 7 items are the source items, each batch takes 1 sample feature from the 8 items, the discriminator D outputs 8-dimension item labels corresponding to 8 samples, and the value in each dimension represents the probability that the sample belongs to the corresponding item.

S204, obtaining the attention scores of the target item and the source item.

By discriminating a sample of the target item by the discriminator D, the similarity of the target item and the source item can be identified as a attention score for the purpose of weighting the loss at a later time. The probability that one sample feature of the target item corresponds to a different source item is obtained as the attention score. The output w of discriminator D is the probability that the input target sample X _i belongs to a different source item, w=softmax (D (G (X _i))). A larger probability means that the input target sample is more similar to the source item and vice versa. Taking seven source items and one target item as an example, assuming that for a certain target item sample, the discriminator D outputs probabilities w= [0.1,0.6,0.1,0.1,0.05,0.05,0] of corresponding 7 source items, the similarity between the input target sample and the second source item is the highest, and the similarity between the input target sample and the last item is the lowest.

S205, obtaining the countermeasure training loss.

In S202, the gradient inversion decreases the gradient that should have been increased at the time of updating, and the gradient that should have been decreased increases, thereby confusing the discrimination capability of the discriminator D for the target sample, and extracting more domain invariant features.

From the prediction results of the sample predictions for the target item, which belong to different source items, classification losses are calculated by S204 and S205, where classification losses are calculated using cross entropy losses.

S206, calculating the maximum mean difference for the characteristics of each source item and each target item.

For each source item, a maximum mean difference of the sample features of the source item and the sample features of the target item is calculated as follows:

wherein, Representing the maximum mean difference between the source item S _j and the target item T,/>For the sample number of the source item S _j, n _t is the sample number of the target item,/>For the ith sample of source item S _j,/>An h sample representing the target item, gaussian kernel/>Sigma represents the standard deviation of the gaussian distribution.

Reducing the maximum mean difference between the source item and the target item may reduce the feature distribution difference between the source item and the target item.

S207, weighting the plurality of maximum mean differences.

In combination with the attention score obtained in S204, the plurality of maximum mean differences calculated in S206 are weighted, which has the advantage that the influence of similar source items on target items is amplified, and the influence of irrelevant source items on target items is reduced, so that:

Where w _j denotes the similarity of the target sample and the source item j, Representing a weighted sum of the maximum mean differences for all source and target items.

S208, the weighted maximum mean difference loss is obtained as the loss of the encoder G.

In S207, the weighted maximum mean difference is obtained as a loss term for the encoder GFurther reducing the feature distribution difference of the source item and the target item.

S209, training a classifier C for the class labels of the source items.

In the present invention, in addition to aligning the feature distribution of the target item and the source item, the representation learning on each source item is also important to the classification result. And training a classifier C by combining the sample characteristics of the source item extracted in the step S201 and the class labels of the samples, wherein the input of the classifier C is the sample characteristics of the source item which are not subjected to gradient inversion, and the input of the classifier C is a specific defect class.

S210, obtaining the classification loss.

The classification loss obtained by classifying in S209 is used as a loss term, i.eWherein Y _i represents sample/>If the defect class has 44 classes, the defect class vector is a 44-dimensional vector, and each data bit corresponds to a probability of a defect.

S211, training and updating parameters of the encoder G, the discriminator D and the classifier C.

In the present invention, a complete model comprising encoder G, discriminator D and classifier C is trained by three losses from S205, S208 and S210, respectively, i.e

In the present invention, lossUpdate related to encoder G and discriminator D,/>Involving updating of encoder G,/>To the updating of the encoder G and the classifier C. The training data set is used for training the whole model, and model parameters are updated.

S212, outputting the trained encoder G and the classifier C.

In the test stage of the present invention, the target samples are tested using the encoder G and the classifier C trained in steps S201 to S211.

S213, inputting the target item data set.

In practice, the class label of the target item is empty, and in the present invention, the class label of the target item dataset is considered to be present but not labeled. After the source code file in the target item is segmented, the encoder G extracts the characteristics.

S214, predicting on the target item data set by using the classifier C.

The classifier C classifies the samples in the target item dataset based on the information learned on the target item dataset.

S215, calculating a prediction result of the classifier C. And summarizing the results of the classifier C, and counting the number according to different defect categories.

S216, outputting a classification result.

Correspondingly, the invention can also realize a multi-source cross-project software defect prediction device according to the method, which comprises the following steps: an input module, an encoder G, a discriminator D, a classifier C and a gradient inversion module. Wherein the input module receives input of a plurality of source item datasets and a target item dataset; the data set of the target item or each source item comprises samples of K types of defect categories, and each sample is provided with a defect category label and an item label; the defect class labels of the source item samples are known and the defect class labels of the target items are unknown. The encoder G extracts characteristics from samples in all project data sets; each sample is a source code file, firstly, the source code is segmented to obtain a corresponding token id, and then the token id is input into the encoder G for feature extraction. The gradient inversion module performs gradient inversion operation on the extracted sample features. The discriminator D trains by using sample characteristics including all source items and target items after gradient inversion, and outputs probabilities that each sample belongs to different items. The classifier C inputs the extracted sample features of the source item and outputs probabilities that the samples belong to various defect types.

The encoder G, the discriminator D and the classifier C are comprehensively trained as a whole, model parameters are updated, and a Loss function Loss during training is as described in S211. The encoder G uses the attention score to reduce the maximum mean difference of the source item and the target item so as to achieve the purpose of reducing the characteristic distribution difference of the source item and the target item. Coding loss of encoder GThe acquisition mode is as follows: obtaining the probability that one sample feature of the target item corresponds to the belonging to different source items by using the discriminator D as an attention score, calculating the maximum mean difference of the sample features of each source item and target item, and weighting and summing the maximum mean differences of the sample features of each source item and target item by using the attention score as the coding loss/>, of the encoder GAfter training, the trained encoder G and classifier C are output.

And (3) utilizing the trained encoder G and classifier C, segmenting the source code file of the target item, inputting the segmented source code file into the encoder G to extract sample characteristics, and inputting the sample characteristics into the classifier C to perform defect type recognition.

Furthermore, the present invention is based on the above method, and further realizes a readable storage medium having a computer program stored thereon, which when executed by a processor, implements a multi-source cross-project software defect prediction method as described above.

The multi-source cross-project software defect prediction method is subjected to experimental verification, and the defect classification accuracy ACC index is shown in table 1.

TABLE 1 accuracy of multi-source cross-project software defect prediction using the method of the present invention

As shown in table 1, apache JMeter, apache Jena, APACHE LENYA, … … JTree of the first row are 8 items. In the experiment, each item is used as a target item, the rest 7 are used as source items, the defect classification is carried out by using the method of the invention, the accuracy of sample defect classification of the target item in each experiment is shown in the second row in the table 1, and the result shows that the method of the invention realizes the prediction of multi-source cross-item software defects, ACC is above 0.9, and the prediction accuracy of sample defects of the target item is high.

Other than the technical features described in the specification, all are known to those skilled in the art. Descriptions of well-known components and well-known techniques are omitted so as to not unnecessarily obscure the present application. The embodiments described in the above examples are not intended to represent all the embodiments consistent with the present application, and various modifications or variations may be made by those skilled in the art without the need for inventive effort on the basis of the technical solutions of the present application while remaining within the scope of the present application.

Claims

1. A multi-source cross-project software defect prediction method, characterized in that it includes the following steps:

Step 1: input multiple source project datasets and one target project dataset;

The dataset of the target project or each source project contains samples of K defect categories, each of which has a defect category label and a project label. The defect category labels of the source project samples are known, while the defect category labels of the target project are unknown. The project label is used to mark which source project or target project the sample belongs to.

Step 2: Use encoder G to extract features from samples in all project datasets;

Each sample is a source code file. First, the source code is segmented to obtain the corresponding token id, and then the token id is input into the encoder G for feature extraction;

Step 3, perform a gradient reversal operation on the extracted sample features, and use the gradient-reversed sample features to train the discriminator D;

The input of the discriminator D contains the sample features of all source items and target items after gradient inversion, and the output is the probability that each sample belongs to a different item. During the training process, the probability that a sample feature of the target item belongs to a different source item is obtained as the attention score, and the adversarial training loss of the discriminator D is calculated at the same time.

The discriminator D is implemented using a fully connected layer, and the adversarial training loss of the discriminator D is Expressed as:

Where n represents the number of sample features of all source items and target items, _Xi represents the i-th sample feature, _di is the item label of the i-th sample, and the superscript represents transpose; G represents encoder, D represents discriminator; parameter λ gradually changes from 0 to 1 using the following formula,/> The parameter p increases with the number of training rounds, p∈[0,1];

Step 4, calculating the maximum mean difference of the sample features of each source item and the target item, and using the attention score to perform a weighted summation of the maximum mean difference of the sample features of each source item and the target item as the encoding loss of the encoder G;

Suppose the jth source item _Sj contains The number of samples is calculated, and the maximum mean difference of the sample features of the source project S _j and the target project T is calculated/> as follows:

Among them, n _t is the number of samples of target item T, is the i-th sample of source item S _j ,/> Represents the hth sample of target item T, Gaussian kernel function/> σ represents the standard deviation of Gaussian distribution, X and X′ are two sample features;

The encoding loss of encoder G The weighted summation is based on the maximum mean difference between all source items and target items, as follows:

in, represents the weighted sum of the maximum mean differences between all source items and target items; S represents the source item set, which contains M source items; _wj is the attention score of the target item for the source item _Sj ;

Step 5, establish classifier C, the input of classifier C is the sample features of the source project extracted in step 2, and the output is the probability that the sample belongs to various defect types;

Classification loss of classifier C as follows:

Where _Yi represents the sample/> The defect category label, C is the classifier;

Step 6, comprehensively train the encoder G, discriminator D and classifier C, and update the model parameters;

The loss during comprehensive training is: Among them,/> is the adversarial training loss of the discriminator D,/> is the coding loss of encoder G, /> is the classification loss of classifier C; /> Updates applied to encoder G and discriminator D,/> Updates acting on encoder G, /> Updates to the encoder G and classifier C;

Step 7: Obtain the trained encoder G and classifier C, segment the source code file of the target project, input it into the encoder G to extract sample features, and then input it into the classifier C to identify the defect category.

2. The method according to claim 1 is characterized in that in the step 2, a pre-trained model CodeT5 based on the Transformer architecture is used as the encoder G.

3. The method according to claim 1 or 2 is characterized in that, in the step 2, the feature length of each extracted sample is set to be the same; the number of features extracted from each source project data set and the target project data set is set to be the same, and repeated feature extraction of samples is performed for data sets with fewer samples to increase the number of features output by the data set.

4. The method according to claim 1 is characterized in that in the step 3, during the training process, one sample feature is taken from each source project and target project in each round of training as the input of the discriminator D, and the discriminator D outputs the project label corresponding to each sample. The project label is a vector of M+1 dimensions, where M is the number of source projects and the value on each dimension represents the probability that the sample belongs to the corresponding project.

5. A multi-source cross-project software defect prediction device, characterized in that the device implements a multi-source cross-project software defect prediction method according to any one of claims 1 to 4, and the device comprises the following functional modules:

The input module receives inputs of multiple source project data sets and one target project data set; the target project or each source project data set contains samples of K defect categories, each sample has a defect category label and a project label; the defect category label of the source project sample is known, while the defect category label of the target project is unknown; the project label is used to mark which source project or target project the sample belongs to;

Encoder G is used to extract features from samples in all project data sets. Each sample is a source code file. The source code is first segmented to obtain the corresponding token id, and then the token id is input into encoder G for feature extraction.

A gradient reversal module is used to perform a gradient reversal operation on the extracted sample features;

The discriminator D takes as input the sample features of all source items and target items after gradient inversion, and outputs the probability that each sample belongs to a different item;

Classifier C, which inputs the sample features of the extracted source items and outputs the probability that the sample belongs to various defect types;

Among them, the encoder G, discriminator D and classifier C are trained comprehensively, and the loss is the adversarial training loss of the discriminator D,/> is the coding loss of encoder G, /> is the classification loss of classifier C; /> Updates applied to encoder G and discriminator D,/> Updates acting on encoder G, /> Updates applied to encoder G and classifier C; encoding loss of encoder G/> The acquisition method is: use the discriminator D to obtain the probability that a sample feature of the target project belongs to different source projects as the attention score, calculate the maximum mean difference between the sample features of each source project and the target project, and use the attention score to perform weighted summation of the maximum mean difference between the sample features of each source project and the target project as the encoding loss of the encoder G;

After obtaining the trained encoder G and classifier C, the source code file of the target project is segmented and input into the encoder G to extract sample features, and then input into the classifier C for defect category identification.

6. A readable storage medium having a computer program stored thereon, characterized in that when the program is executed by a processor, a multi-source cross-project software defect prediction method as described in any one of claims 1-2 is implemented.