CN114328174A

CN114328174A - Multi-view software defect prediction method and system based on counterstudy

Info

Publication number: CN114328174A
Application number: CN202111329931.2A
Authority: CN
Inventors: 韩璐; 严军荣; 潘方
Original assignee: Sunwave Communications Co Ltd
Current assignee: Sunwave Communications Co Ltd
Priority date: 2021-11-10
Filing date: 2021-11-10
Publication date: 2022-04-12

Abstract

The invention discloses a multi-view software defect prediction method and a system based on counterstudy, wherein the method comprises the following steps: constructing a first network model according to sample data of a multi-view software module, wherein the first network model is used for constructing an inter-view discriminant analysis loss function for distinguishing similar views from heterogeneous views through depth measurement learning; constructing a second network model according to the multi-view software module sample data, wherein the second network model is used for constructing a countermeasure loss function for distinguishing different software module views in a public subspace through countermeasure learning; constructing a third network model according to the first network model and the second network model; and inputting the multi-view software test data into the third network model to obtain a prediction result. The invention solves the technical problems of poor prediction performance and low accuracy of prediction results of the existing single-view-based software defect prediction technology.

Description

Multi-view software defect prediction method and system based on counterstudy

Technical Field

The invention belongs to the technical field of software defect prediction, and particularly relates to a multi-view software defect prediction method and system based on counterstudy.

Background

The existing software defect prediction method is generally that a software module set is constructed based on a metric element, then a prediction model is designed in the existing software module according to historical data, and finally the existing tendency of defects of the new software module is predicted.

In recent years, with the development of Deep Neural Networks (DNNs), software defect prediction methods based on generation of countermeasure networks (GANs) have become a new focus of research. For example, chinese patent publication No. CN113419948A, "a prediction method for deep learning cross-project software defects based on GAN network", proposes to use a simplified abstract syntax tree to represent the code of each extracted program module in the target project and the source project; extracting token vectors through a depth traversal abstract syntax tree; performing word embedding on the token vector to obtain a word vector corresponding to each word, replacing the token in the token vector with the word vector, and converting the token vector into a numerical value vector; taking a numerical vector corresponding to a source item as input, and training a source encoder and a source classifier; taking a numerical vector corresponding to a target item as input, and setting initial parameters of a target encoder to be the same as parameters of a trained source encoder; taking the output characteristics of the trained source encoder as real data in the GAN network, taking the output characteristics of the target encoder as false data, and training through a discriminator of the GAN network; classifying the output characteristics of the target encoder by using a trained source classifier; and outputting a classification result. Chinese patent CN110162475A, a software defect prediction method based on deep migration, proposes to convert source code files of source items and target items into image files by a visualization method; constructing a deep migration network; constructing a loss function according to the maximum mean difference between the training sample characteristics and the test sample characteristics extracted by adopting a self-attention mechanism and the cross entropy of the prediction output of the deep migration network and the truth label self-checking of the sample, and training the deep migration network by taking the convergence of the loss function as a target to obtain a software defect prediction model; when the method is applied, the source code file to be detected is converted into an image by a visualization method, the image is input into a software defect prediction model, and a defect prediction result of the source code file to be detected is output after calculation.

The software defect prediction technology based on the countermeasure network is mainly based on a single view, and the obtained measurement element attributes are directly connected in series to be used as a sample vector for subsequent feature learning. However, in the process of extracting the attributes of the software module sample metric elements, the metric elements can be divided into static software module views and dynamic software module views from the aspects of static metrics and dynamic metrics, and single-view data often lacks complementary information compared with multi-view data. Therefore, the existing single-view-based software defect prediction technology has poor prediction performance and low accuracy of prediction results.

At present, a multi-view-oriented software defect prediction technology based on a countermeasure network does not exist, and therefore a multi-view software defect prediction method based on countermeasure learning is provided.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method and a system for predicting multi-view software defects based on counterstudy.

The invention discloses a multi-view software defect prediction method based on antagonistic learning, which is characterized by comprising the following steps of:

constructing a first network model according to sample data of a multi-view software module, wherein the first network model is used for constructing an inter-view discriminant analysis loss function for distinguishing similar views from heterogeneous views through depth measurement learning;

constructing a second network model according to the multi-view software module sample data, wherein the second network model is used for constructing a countermeasure loss function for distinguishing different software module views in a public subspace through countermeasure learning;

constructing a third network model according to the first network model and the second network model;

and inputting the multi-view software test data into the third network model to obtain a prediction result.

Preferably, before the first network model is constructed according to the multi-view software module sample data, the method further comprises the following steps:

normalizing the metric elements of the multi-view software module;

and nonlinearly projecting the sample data of the normalized multi-view software module to a public subspace.

Further preferably, the normalizing the metric of the multi-view software module includes the steps of:

any software module sample in the software warehouse is represented as a static software module view consisting of static measurement elements and a dynamic software module view consisting of dynamic measurement elements; the static measurement element represents the attribute information of the item after the item is developed and is counted, and the dynamic measurement element represents the attribute information recorded in the development process;

and (5) enabling the measurement elements in the static software module view and the dynamic software module view to be in the same dimension, namely an interval [0,1] through a min-max normalization method.

Further preferably, the non-linearly projecting the sample data of the normalized multi-view software module to the common subspace includes:

extracting static software module view initial features from a static software module view dataset

Extracting dynamic software module view initial features from a dynamic software module view dataset

Constructing a parameter-shared two-channel network;

initializing static software module views to an initial feature

Inputting four-layer FNN network to obtain static software module view specific characteristics

Initializing dynamic software module views to features

Inputting four-layer FNN network to obtain specific characteristics of dynamic software module view

Wherein

A four-layer FNN network mapping function representing a view of static software modules,

a four-layer FNN network mapping function representing a view of the dynamic software module,

sharing a network parameter θ_FNN。

Preferably, the constructing of the inter-view discriminant analysis loss function for distinguishing homogeneous views from heterogeneous views through depth metric learning includes the steps of:

calculating distances between corresponding sample features of the static software module view and the dynamic software module view in a common subspace

Wherein S (i) represents sample features of a static software module view, and D (j) represents sample features of a dynamic software module view;

using a value 1 to represent that the sample type is defective, using a value 0 to represent that the sample type is non-defective, wherein samples which consist of the same measurement elements and have the same sample type belong to the same type view, and samples which consist of the same measurement elements and have different sample types belong to the different type view;

constructing an inter-view discriminant analysis loss function L_G：

Where the function h (t) max (0, t) represents the hinge loss function, γ is a previously set hyperparameter, τ is a previously set positive threshold, l () represents the sample class,

representing a static software module view of an original sample,

representing a dynamic software module view raw samples.

Preferably, the constructing of the countermeasure loss function for distinguishing different software module views in a common subspace through countermeasure learning includes the steps of:

constructing static software module view discriminators in a common subspace

And dynamic software module view discriminator

Taking the view characteristics of the static software module view as a real sample, taking the view characteristics of the dynamic software module view as a generation sample, and establishing a resistance loss function based on the static software module view:

wherein Pdata is the view characteristics of the static software module view in the common subspace, PG is the view characteristics of the dynamic software module view in the common subspace,

discriminator for static software module view

Network parameter of, E_x～PdataA data distribution representing a view of a static software module,

a data distribution representing a dynamic software module view;

the view characteristics of the dynamic software module view are used as real samples, and the view characteristics of the static software module viewCharacterizing as a generation sample, and establishing a resistance loss function based on a dynamic software module view:

wherein

Discriminator for dynamic software module view

The network parameter of (2);

obtaining a discriminant loss function of the countermeasure network according to the countermeasure loss function based on the static software module view and the countermeasure loss function based on the dynamic software module view:

preferably, the constructing the third network model according to the first network model and the second network model is combining an inter-view discriminant analysis loss function L of the first network model_GDiscriminant loss function L of the countermeasure network with the second network model_D(θ_D) Training by adopting a minimum and maximum game strategy, wherein the training is represented as:

and

and obtaining a third network model parameter by adopting an optimization algorithm of random gradient descent.

Preferably, the inputting the multi-view software test data into the third network model to obtain the prediction result includes the steps of:

inputting the multi-view software test data into a third network model to obtain sample characteristics;

inputting sample features into a classifier;

and the classifier outputs a classification result according to the sample characteristics, namely a prediction result.

A computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program, when executed by a processor, causes a computer to perform the above-mentioned method.

A multi-view software bug prediction system based on counterlearning, comprising:

an input-output device;

a processor;

a memory;

and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs causing the computer to perform the above-described method.

The method and the system have the advantages that:

(1) the nonlinear characteristics in the constraint subspace are learned by utilizing depth measurement, and the inter-view discriminant analysis loss function is designed, so that different samples among similar views are compact, different samples among heterogeneous views are far away from each other, the discriminant analysis capability among the views is improved, and the structural relationship of data among the views is effectively mined.

(2) And a discriminator is constructed, and a confrontation loss function for distinguishing different software module views in the public subspace is constructed through confrontation learning, so that the static software module view characteristics and the dynamic software module view characteristics can be effectively distinguished and distinguished under the condition of giving characteristic projection on an unknown public subspace.

(3) The total network is constructed according to the inter-view discrimination analysis loss function and the countermeasure loss function, and the structural relationship of the inter-view data can be mined on the basis of keeping the characteristic structure, so that the discrimination capability of the network model is effectively enhanced, and the classification prediction performance and the accuracy of the prediction result are improved.

Drawings

FIG. 1 is a flowchart of a method for predicting multi-view software defects based on counterlearning according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a multi-view software defect prediction system based on counterstudy.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

An embodiment of the present invention based on multi-view software defect prediction against learning is shown in fig. 1, and the flowchart includes:

In a preferred embodiment, before the first network model is constructed according to the multi-view software module sample data, the method further includes the following steps:

normalizing the metric elements of the multi-view software module;

In a preferred embodiment, the normalizing the metric of the multi-view software module includes the steps of:

In this embodiment, any software module sample v in the software warehouse_iIs shown as

The vector components of the joint representation are composed,

corresponding sample v_iThe static software module view of (1) is composed of static measurement elements,

corresponding sample v_iThe dynamic software module view of (1) is composed of dynamic measurement elements. By min-max normalization method

And

is in the same dimension, i.e. the interval [0,1]]And realizing the normalization of the sample metric element.

In a preferred embodiment, the non-linearly projecting the sample data of the normalized multi-view software module to the common subspace includes the steps of:

Constructing a parameter-shared two-channel network;

initializing static software module views to an initial feature

Initializing dynamic software module views to features

Wherein

sharing a network parameter θ_FNN。

In this embodiment, the data set is viewed from a static software module

Extracting corresponding initial features of static software module view and dynamic software module view, and recording as

Viewing a data set from a dynamic software module

Constructing a parameter-shared dual-channel network, and recording the shared network parameter as theta_FNN；

Initial features from static software module views

Four-layer FNN network mapping function of static software module view

Computing specific features of static software module views

Initial features from dynamic software module views

Four-layer FNN network mapping function of dynamic software module view

Computing particular features of dynamic software module views

In a preferred embodiment, the constructing the inter-view discriminant analysis loss function for distinguishing homogeneous view from heterogeneous view through depth metric learning includes the steps of:

computing static software module views and dynamicsSoftware module views distances between corresponding sample features in a common subspace

the value 1 represents that the sample class is defective, and the value 0 represents that the sample class is non-defective;

constructing an inter-view discriminant analysis loss function L_G：

representing a static software module view of an original sample,

representing a dynamic software module view raw samples.

In the embodiment, depth metric learning is utilized to constrain nonlinear characteristics in a subspace, and an inter-view discriminant analysis loss function is designed to realize compactness of different samples among similar views and mutual separation of different samples among heterogeneous views, so that the capability of discriminant analysis among views is improved. The original samples of the static software module view and the dynamic software module view comprise the same measurement element and the samples with the same type belong to the same type of view, and the samples with the same measurement element and different types belong to the different type of view.

According to l₂Paradigm calculation static software module views and dynamic software module views in a common subspaceThe distance between any two corresponding sample features is:

in the above formula (1), s (i) represents the sample characteristics of the static software module view, d (j) represents the sample characteristics of the dynamic software module view,

sample features mapped in a common subspace for a static software module view,

sample features mapped in a common subspace for a dynamic software module view;

sample classes are denoted by l (, ml), which are classified as defective (denoted by 1) and non-defective (denoted by 0);

constructing an inter-view discriminant analysis loss function L_G：

In the above formula (2), the function h (t) ═ max (0, t) represents a hinge loss function, γ is a previously set hyper-parameter, τ is a previously set positive threshold value,

representing a static software module view of an original sample,

representing a dynamic software module view raw samples.

In a preferred embodiment, the constructing of the countermeasure loss function for distinguishing different software module views in a common subspace through countermeasure learning includes the steps of:

constructing static software module view discriminators in a common subspace

And dynamic software module view discriminator

discriminator for static software module view

a data distribution representing a dynamic software module view;

taking the view characteristics of the dynamic software module view as a real sample, taking the view characteristics of the static software module as a generation sample, and establishing a resistance loss function based on the dynamic software module view:

wherein

Discriminator for dynamic software module view

The network parameter of (2);

in this embodiment, a static software module view discriminator in a common subspace is constructed

And dynamic software module view discriminator

Under the condition of giving feature projection on an unknown public subspace, judging whether the view feature of the static software module or the view feature of the dynamic software module is the view feature as much as possible;

pdata in the above equation (3) is the view characteristics of the static software module view in the common subspace, PG is the view characteristics of the dynamic software module view in the common subspace,

discriminator for static software module view

a data distribution representing a dynamic software module view;

in the above formula (4)

Discriminator for dynamic software module view

The network parameter of (2);

and (3) synthesizing the formula (3) and the formula (4) to obtain a discriminant loss function of the countermeasure network:

in a preferred embodiment, the constructing the third network model based on the first network model and the second network model is combining an inter-view discriminant analysis loss function L of the first network model_GDiscriminant loss function L of the countermeasure network with the second network model_D(θ_D) Training by adopting a minimum and maximum game strategy, wherein the training is represented as:

and

In this embodiment, the third network model is composed of a generation model and a discrimination model, combines corresponding loss functions, that is, formula (2) and formula (5), and is trained by using the infinitesimal maximum game strategy, which is expressed as:

and obtaining the third network model parameter by adopting an optimization algorithm of random gradient descent.

In a preferred embodiment, inputting the multi-view software test data into the third network model to obtain the predicted result comprises the following steps:

inputting sample features into a classifier;

In this embodiment, the multi-view software test data is input into the third network model, the sample characteristics are obtained through calculation, the sample characteristics are input into a preset classifier, such as a softmax classifier, and the classifier outputs a classification result (defective or non-defective) according to the sample characteristics, which is a prediction result.

The following describes the advantageous effects of the present invention with reference to specific experiments.

The invention performs experiments on a widely used software defect prediction public test data set AEEEM. Table 1 lists the entries contained in the AEEEM data set, and the number of samples, the proportion of defective samples, and the number of measurement units for each entry.

TABLE 1 AEEEM data set

Name of item	Number of samples	Proportion of defective sample (%)	Number of measurement elements
				EQ	324	39.81	61
JDT	997	20.66	61
				LC	691	9.26	61
ML	1862	13.16	61
				PDE	1497	13.96	61

Firstly, an AEEEM data set constructs a static measurement element set such as LOC (lines of code), FANIN (number of Input data) and the like according to attribute information counted after the development of the project is finished, and constructs a dynamic measurement element set such as NREV (number of videos), DELETELOC (lines delayed) and the like according to the attribute information recorded in the development process. The full name information of each item in the AEEEM data set is as follows: EQ for Equinox Framework, JDT for eclipseJDTCore, LC for Apachelcene, ML for Mylyn, and PDE for eclipseDEUI.

In this experiment, two indicators widely used in software defect prediction techniques were still used: f-measure and G-measure are used to evaluate the performance of the model. The F-measure and the G-measure are respectively calculated by the following formulas:

F-measure＝2*pd*precision/(pd+precision) (8)

G-measure＝(2*pd*specificity)/(pd+specificity) (9)

wherein, the statistical measurement of recall (pd) is defined as TP/(TP + FN), TP represents True Positive, FN represents False Negative; precision (Pre) is a statistical measure defined as TP/(TP + FP), which represents False Positive. The G-measure considers both recall and specificity and is the geometric mean of recall and specificity. specificity is a statistical indicator defined as TN/(TN + FP), where TN denotes True Negative. The larger the F-measure is, the better the performance of the cross-project defect prediction is.

To evaluate the performance of the invention, the following methods were chosen for comparison, respectively: (1) the method of Depth Canonical Correlation Analysis (DCCA) in the literature "Multi-view predictor: a deep model for learning surface identity and view representation" (authors Zhou Z et al); (2) NN-filter method in the document "On the relative value of cross-composition and with-composition data for defect prediction" (author Turhan B, etc.); (3) the multiview depth mesh (MvDN) method in the document "Multi-view depth network for cross-view classification" (author Kan mn et al).

To address the randomness of example selection, 5 experiments were performed randomly. Finally, the F-measure and G-measure means for each test item are reported, as shown in tables 2 and 3. As can be seen from Table 2, the prediction performances of the method are superior to those of DCCA, NN-filter and MvDN methods, and the main reasons are as follows: the DCCA method does not pay much attention to the mining of the identification information among the views; the NN-filter is used for connecting different views in series in the experimental process, then classifying the views, and not paying much attention to the relation among the views; compared with the MvDA method, the method has the advantages that the feature learning is carried out on the software module, meanwhile, the effective identification features of the views are extracted by using the countermeasure network, and the high-level semantic features of the views can be deeply mined. Therefore, the prediction performance of the method is superior to that of a comparison method, and the method is an effective software defect characteristic learning method.

TABLE 2 average F-measure values of the invention and comparison methods across various items

TABLE 3 average G-measure values of the inventive and comparative methods across various projects

The multi-view software defect prediction system based on counterstudy of the embodiment of the invention has a structural schematic diagram as shown in fig. 2, and is characterized by comprising:

an input-output device;

a processor;

a memory;

and

Of course, those skilled in the art should realize that the above embodiments are only used for illustrating the present invention, and not as a limitation to the present invention, and that the changes and modifications of the above embodiments will fall within the protection scope of the present invention as long as they are within the scope of the present invention.

Claims

1. A multi-view software defect prediction method based on counterstudy is characterized by comprising the following steps:

2. The method of claim 1, further comprising the steps of, before constructing the first network model from the multi-view software module sample data:

normalizing the metric elements of the multi-view software module;

3. The method of claim 2, wherein the normalizing the metric elements of the multi-view software module comprises the steps of:

4. The method of claim 2, wherein the non-linearly projecting the sample data of the normalized multi-view software module to the common subspace comprises the steps of:

Constructing a parameter-shared two-channel network;

initializing static software module views to an initial feature

Initializing dynamic software module views to features

Wherein

sharing a network parameter θ_FNN。

5. The method for multi-view software defect prediction based on antagonistic learning according to claim 4, wherein the construction of the inter-view discriminant analysis loss function for distinguishing homogeneous view from heterogeneous view by depth metric learning comprises the steps of:

constructing an inter-view discriminant analysis loss function L_G：

And is

And is

Or

Or

d(s) (i), d (j))), wherein the function h (t) max (0, t) denotes the hinge loss function, γ is a previously set hyperparameter, τ is a previously set positive threshold, l (·) denotes the sample class,

representing a static software module view of an original sample,

representing a dynamic software module view raw samples.

6. The method of claim 5, wherein the constructing a countermeasure loss function for differentiating different software module views in a common subspace through countermeasure learning comprises the steps of:

constructing static software module view discriminators in a common subspace

And dynamic software module view discriminator

discriminator for static software module view

a data distribution representing a dynamic software module view;

wherein

Discriminator for dynamic software module view

The network parameter of (2);

7. the method of claim 6, wherein the building of the third network model from the first network model and the second network model is an inter-view integration of the first network modelDiscriminant analysis loss function L_GDiscriminant loss function L of the countermeasure network with the second network model_D(θ_D) Training by adopting a minimum and maximum game strategy, wherein the training is represented as:

and

8. The method for predicting defects in multi-view software based on antagonistic learning as claimed in claim 1, wherein the step of inputting multi-view software test data into a third network model to obtain a prediction result comprises the steps of:

inputting sample features into a classifier;

9. A computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to perform the method according to any one of claims 1-8.

10. A multi-view software bug prediction system based on counterlearning, comprising:

an input-output device;

a processor;

a memory;

and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the programs causing the computer to perform the method of any of claims 1-8.