CN112711543A

CN112711543A - Workload perception defect prediction method based on weighted software network

Info

Publication number: CN112711543A
Application number: CN202110323054.1A
Authority: CN
Inventors: 宫丽娜; 周宇; 宫宜辉
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-04-27
Anticipated expiration: 2041-03-26
Also published as: CN112711543B

Abstract

The invention belongs to the field of software defect prediction, and particularly discloses a workload perception defect prediction method based on a weighted software network, which designs a correlation strength calculation method between software modules according to two correlation relations of dependence between the software modules and a collaborative developer, further constructs an effective weighted software network structure, and adopts the powerful learning capacity of a graph embedding technology to autonomously learn the characteristic representation of the software modules in a weighted software network graph, so as to better reflect the data between the software modules, call dependence and the dependence relation of the collaborative developer; meanwhile, the workload of the inspection codes for finding the defects is considered in the construction of the defect prediction method, the actual requirements of software development are met, and the software defects are conveniently and accurately found.

Description

Workload perception defect prediction method based on weighted software network

Technical Field

The invention belongs to the field of software defect prediction, and relates to a workload perception defect prediction method based on a weighted software network.

Background

There are inevitably defects in the software development process that will result in significant economic loss. Therefore, how to quickly and accurately find the software defects plays a crucial role in ensuring the quality of the software system.

Software defect prediction techniques are directed to identifying high risk defective modules and narrowing the scope of developers reviewing and testing code, thereby achieving a reasonable distribution of limited resources. The method for predicting defects based on metric information is the most common method, and the metric information mainly comprises a manually designed metric element and an autonomous learning metric element based on an abstract syntax tree.

However, these features learned based on the abstract syntax tree and the manually designed metric elements still cannot sufficiently represent semantic information of the source code, especially rich dependency relationships (such as data dependency, call dependency, and the like) between software modules, and therefore, these metric elements have a less than ideal effect when performing software defect prediction in an actual software engineering project.

Patent document 1 discloses a software defect prediction method based on a graph convolution neural network, which predicts the defect type of an input code file by using a GCN algorithm training model, and the specific principle is as follows:

the method comprises the steps of firstly, achieving file association in a source code by constructing an abstract syntax tree, then, carrying out relationship on files which are possibly transmitted with defects in the code through an association algorithm Apriori, and finally, inputting source file characteristics and association into a GCN model for training.

However, the software network graph constructed in this patent document is mainly based on the abstract syntax tree and the relationship between feature vectors mined by the management algorithm, and does not consider the dependency between modules and the relationship of cooperation with developers.

Patent document 2 discloses a software defect prediction method based on a module dependency graph, and the principle of the method is as follows:

and establishing a software module dependency graph according to the dependency relationship among the software modules, taking developers as nodes in the module dependency graph, adopting network representation learning to extract the dependency characteristics in the software module dependency graph, and establishing a defect prediction model based on the module dependency graph.

However, the software network diagram constructed in this patent document does not take into account the influence of the inter-module dependency strength on defect recognition, and also does not take into account the influence of workload perception when evaluating with the above-described defect prediction model.

In summary, the prior art documents provide a good research basis for the dependency-based defect prediction, however, the defect prediction capability of the current software network metric element is not fully exploited, and is mainly reflected in that:

1. when a software network is constructed, the influence of the correlation strength between the modules on defect identification is not considered;

2. and due to the lack of a workload sensing module, the obtained defect classification result still needs a lot of time to examine codes, and the operability is poor.

Software defects seriously affect the quality of software and even cause serious economic loss. The software modules contain rich semantic and structural information, and the rich semantic and structural relationship influences the defect transmission.

The method fully considers the correlation strength information among the modules, and excavates the influence of the structural semantic information among the software modules on the workload perception defect prediction, thereby being beneficial to providing a more effective workload perception defect prediction model and realizing the reasonable distribution of test resources.

Patent document

Patent document 1: chinese patent application publication No.: CN110888798A, publication date: 2020.03.17, respectively;

patent document 2: chinese patent application publication No.: CN111209211A, publication date: 2020.05.29.

disclosure of Invention

The invention aims to provide a workload perception defect prediction method based on a weighted software network, which considers the influence of the correlation strength between modules on defect identification when constructing a weighted software network diagram, considers the workload of an examination code for finding defects in the construction of a defect prediction method, accords with the actual requirement of software development, and is convenient for quickly and accurately finding the software defects.

In order to achieve the purpose, the invention adopts the following technical scheme:

a workload perception defect prediction method based on a weighted software network comprises the following steps:

I. collecting and fusing data;

collecting software system data and a defect report of a defect tracking system; the collected software system data comprises source codes, code submission and version information data in a version control warehouse;

establishing connection between the code submission and the defect report to realize the fusion of the code submission and the defect report data;

II, marking a defect module;

II.1, extracting text similarity characteristics, personnel consistency characteristics and timestamp characteristics of code submission and a defect report;

based on the extracted three types of features and the connection between the code submission and the defect report established in the step I, finding the missing connection between the code submission and the defect report, and determining the code submission for repairing the defect;

II.2, identifying code submissions introducing defects based on the code submissions repairing the defects;

II.3, based on the code submission of repairing the defects in the step II.1 and the code submission of introducing the defects in the step II.2, marking each software module by combining version information in a version control warehouse to obtain defect label data of each software module;

constructing a weighted software network diagram;

extracting and aggregating comments in source codes of each software module of the software system to obtain document information of each software module, and obtaining developer information of each software module according to the information of a submitter submitted by the codes and the information of a modified file;

establishing data dependency and calling dependency association among software modules, and establishing collaborative developer association among the software modules by using the same developer information among the software modules;

if one of data dependency, call dependency or collaborative developer association exists between any two software modules, the association exists between the two software modules, and a software network diagram with the following structure is constructed, wherein:

each software module is used as a node of the software network graph, and an edge is established between the nodes represented by the two associated software modules;

III.2, calculating the association strength between two software modules connected with each edge in the software network graph according to the document information and the developer information of each software module, and assigning the association strength value as the weight of the edge in the software network graph;

normalizing the weight values of the edges of the software modules which depend on the same software module to obtain a weighted software network graph;

obtaining a software network metric element representation of the weighted software network diagram based on a diagram embedding technology;

IV.1, converting the weighted software network diagram in the step III into a probability co-occurrence matrix, converting the probability co-occurrence matrix into a point mutual information matrix, and adjusting a negative value of the obtained point mutual information matrix to obtain a positive point mutual information matrix;

taking each row of the positive point mutual information matrix as the input of a node, training an automatic coding model, and autonomously learning the rich inter-module structural semantic information in the weighted software network graph to obtain the network metric element representation of the weighted software network graph;

v, constructing a workload perception defect prediction method;

v.1, extracting a code measurement element and a process measurement element of each software module based on source codes and code submission information in a version control warehouse, and combining the network measurement elements obtained in the step IV to form feature data of a defect data set;

combining the characteristic data with the defect label data obtained in the step II to jointly form a software defect data set;

v.2, combining each type of metric element and the combination of at least two types of metric elements in the defect data set into different training sets respectively, and training classifiers by using the training sets to form a base learner;

and V.3, optimizing the combination of the base learners by adopting an optimization algorithm with the aim of optimizing the defects found most by the least workload as the target to form a workload perception defect prediction method, obtaining the probability of the defects of each software module, and further giving the defect detection sequence of each software module.

Preferably, in step I, a number key of the defect report is collected in the code submission, and if the code submission includes the number of the defect report, a connection is established between the code submission and the defect report, so as to implement the fusion of the code submission and the defect report data.

Preferably, in step ii.1, the text similarity feature extraction process is as follows:

collecting and fusing code submission and defect report data based on the step I, and processing and extracting text information of the code submission and the defect report through word segmentation and TF-IDF technology to obtain text similarity characteristics between the code submission and the defect report;

in step II.1, the extraction process of the personnel consistency characteristics is as follows:

extracting personnel consistency characteristics by utilizing the information of the submitter submitted by the codes collected in the step I and the repairer of the defect report;

in step ii.1, the timestamp feature extraction process is as follows:

and D, extracting the time stamp characteristics by using the submission time of the code submission collected in the step I and the repair time of the defect report.

Preferably, in step ii.3, the software module modified between the code submission of step ii.1 to repair the defect and the code submission of step ii.2 to introduce the defect is marked as defective, and the other respective software modules are marked as non-defective.

Preferably, in step iii.2, the calculation formula of the correlation strength between the software modules is as follows:

Degree_Relation_Strength(A,B)=

Similarity(document(A), document(B))+|&develop(A), develop(B)|/(|develop(A)|+|develop(B)|)；

wherein the content of the first and second substances,A,Brepresenting two associated software modules;

Degree_Relation_Strength(A,B) Representing the strength of association between two software modules;

document(.) Document information representing a software module;develop(.) Developer information representing a software module;

Similarity(document(A), document(B) Represent software modulesAAnd software moduleBThe text similarity value of (a);

|develop(.) I represents the number of developers of the software module;

|&develop(A), develop(B) I represents a software moduleAAnd software moduleBThe number of identical developers.

Preferably, in IV.1, probability co-occurrence matrix of weighted software network graph is generatedPCOThe process is as follows:

first construct the adjacency matrixAThe transition probability among different nodes in the weighted software network graph is represented as a transition matrixA(ii) a Wherein the transfer matrixAIs composed ofn×nThe matrix is a matrix of a plurality of matrices,nthe number of the nodes is the number of software modules in the software system;

A _ijrepresenting a transition matrixAThe element of (1), wherein,iandjrespectively representing different nodes in the weighted software network graph;

if nodeiAnd nodejThere is an edge in between, then willA _ijThe weight assigned to the edge, otherwise,A _ijthe value is assigned to 0;

introducing a row vector to each nodep _k=p _k1, p _k2,……,p _kj,……, p _knWherein, in the step (A),p _kjis shown to pass throughkStep transfer to nodejProbability of (2) is the row vectorp _kThe calculation formula is as follows:p _k=（1-a）×p _k-1 A+ a×p ₀；

where, a represents the probability of returning to the original node again,p ₀initializing to One-Hot coding; according to the obtained row vectorp _kValue, resulting in a vector representation of each node in the weighted software network graphr=∑^K _k=1 p _kWherein, in the step (A),Kthe total number of steps;

combining vector representations of each noderGenerating a probability co-occurrence matrix of the weighted software network graphPCO。

Preferably, in IV.1, probability co-occurrence matrixPCOConversion into point mutual information matrixPMIThe conversion formula of (c) is as follows:

PMI _i,j=log(PCO _i,j /(#(i)×#(j)))；

wherein the content of the first and second substances,PMI _i,jas a point mutual information matrixPMITo middleiGo to the firstjElements of a column;

PCO _i,jis a probability co-occurrence matrixPCOTo middleiGo to the firstjElements of a column;

#(.) Representing the summation value of the corresponding column of the probability co-occurrence matrix;

adjusting the negative value of the point mutual information matrix to obtain a positive point mutual information matrixPPMIThe conversion formula of (c) is as follows:

PPMI _i,j=max(PMI _i,j,0)；

wherein the content of the first and second substances,PPMI _i,jis a positive mutual information matrixPPMITo middleiGo to the firstjElements of a column;

max(PMI _i,j0) represents a if-dot mutual information matrixPMI _i,jIs a negative number, thenPPMI _i,jThe value is assigned to 0.

Preferably, in step v.1, the extraction process of the code metric element and the process metric element is as follows:

based on the source codes in the version control warehouse collected in the step I, extracting a code measurement element of each software module by adopting an Understand tool, wherein the code measurement element comprises code line number, function number of classes and annotation line number information;

and (3) extracting process measurement elements of each software module based on the code submission information in the version control warehouse collected in the step I, wherein the process measurement elements comprise the number of adding lines, the number of deleting lines, the number of modifying lines and developer number information.

Preferably, in the step V.2, the classifier adopts a random forest classifier;

forming a formula for calculating the defect probability of each software module based on the random forest classifier, wherein the formula is shown as the following formula;

Probability(x ⁱ)= ∑^T _k=1 a _k Probility _ck(x ⁱ)；

wherein the content of the first and second substances,x ⁱis as followsiA software module for the software module, a software module,Probability(x ⁱ) As software modulesiThe defect probability of (2);

Tthe number of the random forest classifiers is shown,T=7；a _kis as followskParameters of a random forest classifier;

Probility _ck(x ⁱ) First, thekA random forest classifier pairiThe probability of an individual software module;ckrepresenting a basis learner;

defect probability based on each software moduleProbability(x ⁱ) Sorting the software modules in the software system from big to small to obtain the sequence of the codes to be examined;

in order to determine the optimuma _kIs optimized by particle swarm optimizationa _kSuch that the most defects are found when examining code that is 20% of the total code count of a software project; the optimization function is represented as:

cost=|y _bug(x ⁱ)|/| y(x ⁱ)|，i∈k，∑^k _i=1 loc _i<0.2×∑ⁿ _i=1 loc _i；

wherein the content of the first and second substances,costrepresents an optimization function-y _bug(x ⁱ) I represents the number of defect modules found, count y(x ⁱ) L represents the number of the examined software modules;loc _iis shown asiThe number of code lines of each software module;

adjustment ofa _kA value of (A) tocostAnd obtaining the optimal combined strong classifier, namely a workload perception defect prediction method.

The invention has the following advantages:

as mentioned above, the invention relates to a workload perception defect prediction method based on a weighted software network, which designs an association strength calculation method between software modules according to the dependency between the software modules and two association relations with a developer, constructs an effective weighted software network structure, and autonomously learns the feature representation of the software modules in a weighted software network diagram by adopting the powerful learning capability of a diagram embedding technology, thereby better reflecting the data among the software modules, calling the dependency and coordinating the dependency relation of the developer; meanwhile, the workload of the inspection codes for finding the defects is considered in the construction of the defect prediction method, the actual requirements of software development are met, and the software defects are conveniently and quickly and accurately found, so that the time cost and the expense cost of software development are greatly reduced.

Drawings

FIG. 1 is a block flow diagram of a workload aware defect prediction method based on a weighted software network according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a weighting software network diagram constructed in an embodiment of the present invention.

Detailed Description

The basic idea of the invention is as follows: the workload perception defect prediction method based on the weighted software network is provided, so that a method for expressing the correlation strength between modules is conveniently designed according to the dependence type and the dependence times between the modules and the contribution of developers to the modules, the workload perception defect prediction method (model) based on the weighted software network is more effectively provided, the probability of the defect occurrence of each software module is obtained, the defect detection sequence of each software module is further provided, the test resource allocation scheme for detecting the most defects with the least workload by a software development company is further provided, the software quality is improved, and the development of software projects is guided.

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Examples

As shown in fig. 1, the workload-aware defect prediction method based on the weighted software network includes the following steps:

I. and collecting and fusing data.

Software system data is collected as well as defect reports from defect tracking systems such as Bugzilla and JIRA. The software system data collected includes source code, code submissions, and version information data in a version control repository (e.g., Git and SVN).

The code submission data includes title, description, discussion, modified file information, submitter, and submission time information. The defect report data includes number, title, description, discussion, repairer, and repair time information.

In this embodiment, the defect report of the defect tracking system is a defect report that has been repaired in the defect tracking system.

And establishing connection between the code submission and the defect report to realize the fusion of the code submission and the defect report data.

The process of code submission and bug reporting to establish a connection is as follows: collecting the number key words of the defect report in the code submission data (title, description and discussion); and if the code submission data contains the number of the defect report, establishing connection between the code submission and the defect report, so that the integration of the code submission and the defect report data is realized.

Defective module marking.

And II.1, extracting text similarity characteristics, personnel consistency characteristics and time stamp characteristics of the code submission and the defect report.

The specific extraction process is as follows:

collecting and fusing code submission and defect report data based on the step I, processing and extracting text information (title, description and discussion) of the code submission and the defect report to obtain text similarity characteristics between the code submission and the defect report;

the above-mentioned processing of text information is realized, for example, by word segmentation and TF-IDF techniques.

And (4) comparing whether the information of the submitter submitted by the codes collected in the step (I) and the repairer of the defect report is the same person, so as to obtain the personnel consistency characteristic. And D, obtaining the difference value between the two times by utilizing the submission time of the codes submitted in the step I and the repair time of the defect report, so as to obtain the time stamp characteristics.

And combining the extracted text similarity, personnel consistency and timestamp characteristics and the established connection of the code submission and the defect report, and finding the missing connection of the code submission and the defect report so as to determine the code submission for repairing the defect.

The present embodiment preferably uses the pu (positive and unlabeled) Learning algorithm to find the missing connection between the code submission and the defect report. The PU Learning algorithm is a known algorithm and will not be described herein.

II.2 repair defect based code submission with improvementsSZZThe algorithm identifies code submissions that introduce defects. Here, use is made of improvementsSZZThe algorithm identifies code submission that introduces defects as a known process and is not described in detail herein.

And II.3, marking each software module based on the code submission of repairing the defects in the step II.1 and the code submission of introducing the defects in the step II.2 by combining version information in a version control warehouse to obtain defect label data of each software module.

The specific marking process is as follows:

the software module modified between the code submission of step ii.1 to repair the defect and the code submission of step ii.2 to introduce the defect is marked as defective, and the other software modules under the same version are marked as non-defective.

Through the marking process, the defect label of the software module under the corresponding version is obtained.

And III, constructing a weighted software network diagram.

And III.1, extracting and aggregating comments in the source code of each software module of the software system to obtain document information of each software module, and obtaining developer information of each software module according to the submitter submitted by the code and the modified file information.

And establishing data dependency and call dependency association between the software modules by using an Understand tool, and establishing collaborative developer association between the software modules by using the same developer information between the software modules.

The process of establishing data dependency and call dependency association between software modules by using an Understand tool is known.

If one of data dependency, call dependency or cooperation developer correlation exists between any two software modules, the correlation exists between the two software modules is indicated, and a software network diagram with the following structure is constructed; wherein:

each software module acts as a node of the software network graph, and an edge is established between the nodes represented by the two associated software modules.

Based on the principle, a software network diagram based on data dependence, call dependence and collaborative developer association is constructed.

And III.2, according to the document information and the developer information of each software module, providing a calculation mode of the association strength between the software modules based on the influence of the quotation, wherein the calculation formula of the association strength is as follows:

Degree_Relation_Strength(A,B)=

|develop(.) I represents the number of developers of the software module;

And (4) evaluating the association strength value obtained through the calculation, and assigning the association strength value as a weight of the edge in the software network graph. Meanwhile, the weight values of the sides of the software modules which depend on the same software module are normalized to obtain a weighted software network graph.

Taking fig. 2 as an example, if the software module B depends on the software module A, D, E, the weights of the sides BA, BD, BE are normalized so that the sum of them is equal to 1, and the obtained weighted software network diagram is shown in fig. 2.

Because the correlation strength calculation formula mentioned in this embodiment correctly represents the correlation strength between software modules, the constructed weighted software network diagram can better conform to the relationship between actual software modules.

And IV, obtaining a software network metric element representation of the weighted software network diagram based on a diagram embedding technology.

And IV.1, converting the weighted software network graph in the step III into a probability co-occurrence matrix by adopting a random surfing model, and further converting the probability co-occurrence matrix into a positive mutual information matrix based on all nodes of the weighted software network graph.

The specific generation process of the probability co-occurrence matrix is as follows:

if nodeiAnd nodejThere is an edge in between, then willA _ijThe weight assigned to the edge, otherwise,A _ijthe value is assigned to 0.

Introducing a row vector to each nodep _k=p _k1, p _k2,……,p _kj,……, p _knWherein, in the step (A),p _kjis shown to pass throughkStep transfer to nodejWhen the probability of (1-a) is considered to return to the original initial node, the row vectorp _kThe calculation formula is as follows:

p _k=（1-a）×p _k-1 A+ a×p ₀；

wherein a represents the probability of returning to the original node again;

p ₀initialization is One-Hot encoding (i.e., a vector with only the node corresponding to an index of 1 and the other values of 0).

According to the obtained row vectorp _kValue, resulting in a vector representation of each node in the weighted software network graphr=∑^K _k=1 p _kWherein, in the step (A),Kthe total number of steps; combining vector representations of each noderAnd generating a probability co-occurrence matrix of the weighted software network graph.

The probability co-occurrence matrix obtained in the above stepsPCOConversion into point mutual information matrixPMIThe conversion formula is as follows:

PMI _i,j=log(PCO _i,j /(#(i)×#(j)))；

#(.) Representing the sum of the corresponding columns of the probability co-occurrence matrix.

PPMI _i,j=max(PMI _i,j,0)；

The positive point mutual information matrix can ensure that a subsequent automatic coding model can capture higher-order approximation.

And IV.2, taking each row of the positive mutual information matrix as the input of one node, inputting the input into an automatic encoder, and performing automatic encoding model training, wherein the automatic encoder consists of an encoder and a decoder.

Loss function of an automatic encoderlossComprises the following steps:loss=∑_i L(x ⁱ⁽⁾,g _q2(f _q1(x ⁱ⁽⁾) ); wherein the content of the first and second substances,Lis composed ofsample- wiseThe loss of the carbon dioxide gas is reduced,x ⁱ⁽⁾to representPPMITo (1) aiThe number of the row vectors is,g _q2is an activation function of the decoder and,f _q1as an activation function of the decoder.

By independently learning and weighting the rich structural semantic information among the modules in the software network diagram, the network measurement element which better reflects the data dependence and the call dependence among the software modules and the dependence of the collaborative developer is obtained.

In the embodiment, because the network metric element based on the graph embedding technology correctly represents the dependency relationship of the software modules in the weighted software network graph, the association relationship between the modules can be well utilized to find the defects.

And V, constructing a workload perception defect prediction method.

V.1, extracting a code measurement element and a process measurement element of each software module based on source codes and code submission information in a version control warehouse, wherein the specific process is as follows:

based on the source codes in the version control warehouse collected in the step I, extracting code measurement elements of each software module by adopting an Understand tool, wherein the code measurement elements comprise information such as code line number, function number of classes, annotation line number and the like;

and (3) extracting process measurement elements of each software module based on the code submission information (information such as submitters, modified file information and diff) in the version control warehouse collected in the step I, wherein the process measurement elements comprise information such as adding line number, deleting line number, modifying line number and developer number.

Combining the code metric element, the process metric element and the network metric element in the step IV to form feature data of a defect data set; and (5) combining the characteristic data with the defect label data obtained in the step (II) to jointly form a software defect data set.

V.2, considering that the prediction effect of each type of metric element on the defect is different, this embodiment combines each type of metric element and a combination of at least two types of metric elements (including any two types of combinations and three types of combinations) in the defect data set into different training sets, and trains classifiers using the training sets to form a base learner.

In this embodiment, the classifier preferably uses a random forest classifier, and certainly may also be a classifier such as bayes, logistic regression, or the like. Taking a random forest classifier as an example, a formula for calculating the defect probability of each software module is as follows:

Probability(x ⁱ)= ∑^T _k=1 a _k Probility _ck(x ⁱ)；

Probility _ck(x ⁱ) First, thekA random forest classifier pairiThe probability of an individual software module;cka base learner is represented.

Base learning deviceC={C _Code，C _Procedure，C _Network，C _{Code + procedure}，C _{Process + network}，C _{Code + network}，C _{Code + Process + network}}；

Wherein the content of the first and second substances,C _codeA base classifier representing a code metric element;C _Procedurea base classifier representing a process metric-based element;C _networkA base classifier representing an extracted weighted network metric element;C _{code + procedure}Representing a base classifier based on code and process;C _{process + network}Representing a base classifier based on process and network metric elements;C _{code + network}Representing a base classifier based on the code and network metric elements;C _{code + Process + network}The representation represents a base classifier based on code, process, and network metric elements.

And V.3, optimizing the combination of the base learners by adopting an optimization algorithm with the aim of optimizing the defects found most by the least workload as the target to form a workload perception defect prediction method, obtaining the probability of the defects of each software module, and further giving the defect detection sequence of each software module. Based on the defect detection sequence of each software module, a test resource allocation scheme for detecting the most defects with the least workload by a software development company can be provided, so that the software quality is improved, and the development of software projects is guided.

On a per basisDefect probability obtained by software moduleProbability(x ⁱ) And sequencing the software modules in the software system from big to small so as to obtain the sequence of the codes to be examined.

Here, if the probability of the software module being defective is equal, the software module with the smaller code line number is ranked in the front.

adjustment ofa _kA value of (A) tocostAnd obtaining the optimal combined strong classifier, namely a workload perception defect prediction method, wherein the workload perception defect prediction method can find the most defects with the least workload.

In the embodiment of the invention, the steps are synergistic, and the synergistic process is represented as follows:

firstly, data collection and fusion in the step I provide data guarantee for the invention; and secondly, providing high-quality label data for constructing a workload perception defect prediction method based on a weighted software network based on the defect module marking method provided in the step II, providing high-quality network measurement element representation for constructing the workload perception defect prediction method based on the weighted software network by constructing a weighted software network diagram and embedding software network measurement element representation based on the diagram in the step IV, and finally providing powerful guarantee for defect prediction and application to practice by constructing the workload perception defect prediction method in the step V.

It should be understood, however, that the description herein of specific embodiments is not intended to limit the invention to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A workload perception defect prediction method based on a weighted software network is characterized by comprising the following steps:

I. collecting and fusing data;

II, marking a defect module;

constructing a weighted software network diagram;

establishing data dependence and calling dependence correlation among software modules, and simultaneously establishing collaborative developer correlation among the software modules by using the same developer information among the software modules;

v, constructing a workload perception defect prediction method;

2. The weighted software network-based workload-aware defect prediction method of claim 1,

in the step I, number keywords of the defect report are collected in the code submission data, if the code submission data contains the number of the defect report, connection is established between the code submission and the defect report, and the code submission and the defect report data are fused.

3. The weighted software network-based workload-aware defect prediction method of claim 1,

in the step ii.1, the text similarity feature extraction process is as follows:

in the step II.1, the extraction process of the personnel consistency characteristics is as follows:

in the step ii.1, the timestamp feature extraction process is as follows:

4. The weighted software network-based workload-aware defect prediction method of claim 1,

in the step ii.3, the software module modified between the code submission of repairing the defect in the step ii.1 and the code submission of introducing the defect in the step ii.2 is marked as a defective module, and other software modules in the same version are marked as non-defective modules.

5. The weighted software network-based workload-aware defect prediction method of claim 1,

in the step iii.2, the calculation formula of the correlation strength between the software modules is as follows:

Degree_Relation_Strength(A,B)=

|develop(.) I represents the number of developers of the software module;

6. The weighted software network-based workload-aware defect prediction method of claim 1,

in IV.1, generating probability co-occurrence matrix of weighted software network diagramPCOThe process is as follows:

7. The weighted software network-based workload-aware defect prediction method of claim 6,

in IV.1, probability co-occurrence matrixPCOConversion into point mutual information matrixPMIThe conversion formula of (c) is as follows:

PMI _i,j=log(PCO _i,j /(#(i)×#(j)))；

PPMI _i,j=max(PMI _i,j,0)；

8. The weighted software network-based workload-aware defect prediction method of claim 1,

in the step v.1, the extraction process of the code metric element and the process metric element is as follows:

9. The weighted software network-based workload-aware defect prediction method of claim 1,

in the step V.2, a random forest classifier is adopted as the classifier;

Probability(x ⁱ)= ∑^T _k=1 a _k Probility _ck(x ⁱ)；

Tthe number of the random forest classifiers is shown,T=7，a _kis as followskParameters of a random forest classifier;

to determine the bestIs superiora _kIs optimized by particle swarm optimizationa _kSuch that the most defects are found when examining code that is 20% of the total code count of a software project; the optimization function is represented as: