CN112711543A - Workload perception defect prediction method based on weighted software network - Google Patents

Workload perception defect prediction method based on weighted software network Download PDF

Info

Publication number
CN112711543A
CN112711543A CN202110323054.1A CN202110323054A CN112711543A CN 112711543 A CN112711543 A CN 112711543A CN 202110323054 A CN202110323054 A CN 202110323054A CN 112711543 A CN112711543 A CN 112711543A
Authority
CN
China
Prior art keywords
software
defect
code
weighted
software module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110323054.1A
Other languages
Chinese (zh)
Other versions
CN112711543B (en
Inventor
宫丽娜
周宇
宫宜辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202110323054.1A priority Critical patent/CN112711543B/en
Publication of CN112711543A publication Critical patent/CN112711543A/en
Application granted granted Critical
Publication of CN112711543B publication Critical patent/CN112711543B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

The invention belongs to the field of software defect prediction, and particularly discloses a workload perception defect prediction method based on a weighted software network, which designs a correlation strength calculation method between software modules according to two correlation relations of dependence between the software modules and a collaborative developer, further constructs an effective weighted software network structure, and adopts the powerful learning capacity of a graph embedding technology to autonomously learn the characteristic representation of the software modules in a weighted software network graph, so as to better reflect the data between the software modules, call dependence and the dependence relation of the collaborative developer; meanwhile, the workload of the inspection codes for finding the defects is considered in the construction of the defect prediction method, the actual requirements of software development are met, and the software defects are conveniently and accurately found.

Description

Workload perception defect prediction method based on weighted software network
Technical Field
The invention belongs to the field of software defect prediction, and relates to a workload perception defect prediction method based on a weighted software network.
Background
There are inevitably defects in the software development process that will result in significant economic loss. Therefore, how to quickly and accurately find the software defects plays a crucial role in ensuring the quality of the software system.
Software defect prediction techniques are directed to identifying high risk defective modules and narrowing the scope of developers reviewing and testing code, thereby achieving a reasonable distribution of limited resources. The method for predicting defects based on metric information is the most common method, and the metric information mainly comprises a manually designed metric element and an autonomous learning metric element based on an abstract syntax tree.
However, these features learned based on the abstract syntax tree and the manually designed metric elements still cannot sufficiently represent semantic information of the source code, especially rich dependency relationships (such as data dependency, call dependency, and the like) between software modules, and therefore, these metric elements have a less than ideal effect when performing software defect prediction in an actual software engineering project.
Patent document 1 discloses a software defect prediction method based on a graph convolution neural network, which predicts the defect type of an input code file by using a GCN algorithm training model, and the specific principle is as follows:
the method comprises the steps of firstly, achieving file association in a source code by constructing an abstract syntax tree, then, carrying out relationship on files which are possibly transmitted with defects in the code through an association algorithm Apriori, and finally, inputting source file characteristics and association into a GCN model for training.
However, the software network graph constructed in this patent document is mainly based on the abstract syntax tree and the relationship between feature vectors mined by the management algorithm, and does not consider the dependency between modules and the relationship of cooperation with developers.
Patent document 2 discloses a software defect prediction method based on a module dependency graph, and the principle of the method is as follows:
and establishing a software module dependency graph according to the dependency relationship among the software modules, taking developers as nodes in the module dependency graph, adopting network representation learning to extract the dependency characteristics in the software module dependency graph, and establishing a defect prediction model based on the module dependency graph.
However, the software network diagram constructed in this patent document does not take into account the influence of the inter-module dependency strength on defect recognition, and also does not take into account the influence of workload perception when evaluating with the above-described defect prediction model.
In summary, the prior art documents provide a good research basis for the dependency-based defect prediction, however, the defect prediction capability of the current software network metric element is not fully exploited, and is mainly reflected in that:
1. when a software network is constructed, the influence of the correlation strength between the modules on defect identification is not considered;
2. and due to the lack of a workload sensing module, the obtained defect classification result still needs a lot of time to examine codes, and the operability is poor.
Software defects seriously affect the quality of software and even cause serious economic loss. The software modules contain rich semantic and structural information, and the rich semantic and structural relationship influences the defect transmission.
The method fully considers the correlation strength information among the modules, and excavates the influence of the structural semantic information among the software modules on the workload perception defect prediction, thereby being beneficial to providing a more effective workload perception defect prediction model and realizing the reasonable distribution of test resources.
Patent document
Patent document 1: chinese patent application publication No.: CN110888798A, publication date: 2020.03.17, respectively;
patent document 2: chinese patent application publication No.: CN111209211A, publication date: 2020.05.29.
disclosure of Invention
The invention aims to provide a workload perception defect prediction method based on a weighted software network, which considers the influence of the correlation strength between modules on defect identification when constructing a weighted software network diagram, considers the workload of an examination code for finding defects in the construction of a defect prediction method, accords with the actual requirement of software development, and is convenient for quickly and accurately finding the software defects.
In order to achieve the purpose, the invention adopts the following technical scheme:
a workload perception defect prediction method based on a weighted software network comprises the following steps:
I. collecting and fusing data;
collecting software system data and a defect report of a defect tracking system; the collected software system data comprises source codes, code submission and version information data in a version control warehouse;
establishing connection between the code submission and the defect report to realize the fusion of the code submission and the defect report data;
II, marking a defect module;
II.1, extracting text similarity characteristics, personnel consistency characteristics and timestamp characteristics of code submission and a defect report;
based on the extracted three types of features and the connection between the code submission and the defect report established in the step I, finding the missing connection between the code submission and the defect report, and determining the code submission for repairing the defect;
II.2, identifying code submissions introducing defects based on the code submissions repairing the defects;
II.3, based on the code submission of repairing the defects in the step II.1 and the code submission of introducing the defects in the step II.2, marking each software module by combining version information in a version control warehouse to obtain defect label data of each software module;
constructing a weighted software network diagram;
extracting and aggregating comments in source codes of each software module of the software system to obtain document information of each software module, and obtaining developer information of each software module according to the information of a submitter submitted by the codes and the information of a modified file;
establishing data dependency and calling dependency association among software modules, and establishing collaborative developer association among the software modules by using the same developer information among the software modules;
if one of data dependency, call dependency or collaborative developer association exists between any two software modules, the association exists between the two software modules, and a software network diagram with the following structure is constructed, wherein:
each software module is used as a node of the software network graph, and an edge is established between the nodes represented by the two associated software modules;
III.2, calculating the association strength between two software modules connected with each edge in the software network graph according to the document information and the developer information of each software module, and assigning the association strength value as the weight of the edge in the software network graph;
normalizing the weight values of the edges of the software modules which depend on the same software module to obtain a weighted software network graph;
obtaining a software network metric element representation of the weighted software network diagram based on a diagram embedding technology;
IV.1, converting the weighted software network diagram in the step III into a probability co-occurrence matrix, converting the probability co-occurrence matrix into a point mutual information matrix, and adjusting a negative value of the obtained point mutual information matrix to obtain a positive point mutual information matrix;
taking each row of the positive point mutual information matrix as the input of a node, training an automatic coding model, and autonomously learning the rich inter-module structural semantic information in the weighted software network graph to obtain the network metric element representation of the weighted software network graph;
v, constructing a workload perception defect prediction method;
v.1, extracting a code measurement element and a process measurement element of each software module based on source codes and code submission information in a version control warehouse, and combining the network measurement elements obtained in the step IV to form feature data of a defect data set;
combining the characteristic data with the defect label data obtained in the step II to jointly form a software defect data set;
v.2, combining each type of metric element and the combination of at least two types of metric elements in the defect data set into different training sets respectively, and training classifiers by using the training sets to form a base learner;
and V.3, optimizing the combination of the base learners by adopting an optimization algorithm with the aim of optimizing the defects found most by the least workload as the target to form a workload perception defect prediction method, obtaining the probability of the defects of each software module, and further giving the defect detection sequence of each software module.
Preferably, in step I, a number key of the defect report is collected in the code submission, and if the code submission includes the number of the defect report, a connection is established between the code submission and the defect report, so as to implement the fusion of the code submission and the defect report data.
Preferably, in step ii.1, the text similarity feature extraction process is as follows:
collecting and fusing code submission and defect report data based on the step I, and processing and extracting text information of the code submission and the defect report through word segmentation and TF-IDF technology to obtain text similarity characteristics between the code submission and the defect report;
in step II.1, the extraction process of the personnel consistency characteristics is as follows:
extracting personnel consistency characteristics by utilizing the information of the submitter submitted by the codes collected in the step I and the repairer of the defect report;
in step ii.1, the timestamp feature extraction process is as follows:
and D, extracting the time stamp characteristics by using the submission time of the code submission collected in the step I and the repair time of the defect report.
Preferably, in step ii.3, the software module modified between the code submission of step ii.1 to repair the defect and the code submission of step ii.2 to introduce the defect is marked as defective, and the other respective software modules are marked as non-defective.
Preferably, in step iii.2, the calculation formula of the correlation strength between the software modules is as follows:
Degree_Relation_Strength(A,B)=
Similarity(document(A), document(B))+|&develop(A), develop(B)|/(|develop(A)|+|develop(B)|);
wherein the content of the first and second substances,A,Brepresenting two associated software modules;
Degree_Relation_Strength(A,B) Representing the strength of association between two software modules;
document(.) Document information representing a software module;develop(.) Developer information representing a software module;
Similarity(document(A), document(B) Represent software modulesAAnd software moduleBThe text similarity value of (a);
|develop(.) I represents the number of developers of the software module;
|&develop(A), develop(B) I represents a software moduleAAnd software moduleBThe number of identical developers.
Preferably, in IV.1, probability co-occurrence matrix of weighted software network graph is generatedPCOThe process is as follows:
first construct the adjacency matrixAThe transition probability among different nodes in the weighted software network graph is represented as a transition matrixA(ii) a Wherein the transfer matrixAIs composed ofn×nThe matrix is a matrix of a plurality of matrices,nthe number of the nodes is the number of software modules in the software system;
A ij representing a transition matrixAThe element of (1), wherein,iandjrespectively representing different nodes in the weighted software network graph;
if nodeiAnd nodejThere is an edge in between, then willA ij The weight assigned to the edge, otherwise,A ij the value is assigned to 0;
introducing a row vector to each nodep k =p k1, p k2,……,p kj ,……, p kn Wherein, in the step (A),p kj is shown to pass throughkStep transfer to nodejProbability of (2) is the row vectorp k The calculation formula is as follows:p k =(1-a)×p k-1 A+p 0
where, a represents the probability of returning to the original node again,p 0initializing to One-Hot coding; according to the obtained row vectorp k Value, resulting in a vector representation of each node in the weighted software network graphr=∑ K k=1 p k Wherein, in the step (A),Kthe total number of steps;
combining vector representations of each noderGenerating a probability co-occurrence matrix of the weighted software network graphPCO
Preferably, in IV.1, probability co-occurrence matrixPCOConversion into point mutual information matrixPMIThe conversion formula of (c) is as follows:
PMI i,j =log(PCO i,j /(#(i)×#(j)));
wherein the content of the first and second substances,PMI i,j as a point mutual information matrixPMITo middleiGo to the firstjElements of a column;
PCO i,j is a probability co-occurrence matrixPCOTo middleiGo to the firstjElements of a column;
#(.) Representing the summation value of the corresponding column of the probability co-occurrence matrix;
adjusting the negative value of the point mutual information matrix to obtain a positive point mutual information matrixPPMIThe conversion formula of (c) is as follows:
PPMI i,j =max(PMI i,j ,0);
wherein the content of the first and second substances,PPMI i,j is a positive mutual information matrixPPMITo middleiGo to the firstjElements of a column;
max(PMI i,j 0) represents a if-dot mutual information matrixPMI i,j Is a negative number, thenPPMI i,j The value is assigned to 0.
Preferably, in step v.1, the extraction process of the code metric element and the process metric element is as follows:
based on the source codes in the version control warehouse collected in the step I, extracting a code measurement element of each software module by adopting an Understand tool, wherein the code measurement element comprises code line number, function number of classes and annotation line number information;
and (3) extracting process measurement elements of each software module based on the code submission information in the version control warehouse collected in the step I, wherein the process measurement elements comprise the number of adding lines, the number of deleting lines, the number of modifying lines and developer number information.
Preferably, in the step V.2, the classifier adopts a random forest classifier;
forming a formula for calculating the defect probability of each software module based on the random forest classifier, wherein the formula is shown as the following formula;
Probability(x i )= ∑ T k=1 a k Probility ck (x i );
wherein the content of the first and second substances,x i is as followsiA software module for the software module, a software module,Probability(x i ) As software modulesiThe defect probability of (2);
Tthe number of the random forest classifiers is shown,T=7;a k is as followskParameters of a random forest classifier;
Probility ck (x i ) First, thekA random forest classifier pairiThe probability of an individual software module;ckrepresenting a basis learner;
defect probability based on each software moduleProbability(x i ) Sorting the software modules in the software system from big to small to obtain the sequence of the codes to be examined;
in order to determine the optimuma k Is optimized by particle swarm optimizationa k Such that the most defects are found when examining code that is 20% of the total code count of a software project; the optimization function is represented as:
cost=|y bug (x i )|/| y(x i )|,ik,∑ k i=1 loc i <0.2×∑ n i=1 loc i
wherein the content of the first and second substances,costrepresents an optimization function-y bug (x i ) I represents the number of defect modules found, count y(x i ) L represents the number of the examined software modules;loc i is shown asiThe number of code lines of each software module;
adjustment ofa k A value of (A) tocostAnd obtaining the optimal combined strong classifier, namely a workload perception defect prediction method.
The invention has the following advantages:
as mentioned above, the invention relates to a workload perception defect prediction method based on a weighted software network, which designs an association strength calculation method between software modules according to the dependency between the software modules and two association relations with a developer, constructs an effective weighted software network structure, and autonomously learns the feature representation of the software modules in a weighted software network diagram by adopting the powerful learning capability of a diagram embedding technology, thereby better reflecting the data among the software modules, calling the dependency and coordinating the dependency relation of the developer; meanwhile, the workload of the inspection codes for finding the defects is considered in the construction of the defect prediction method, the actual requirements of software development are met, and the software defects are conveniently and quickly and accurately found, so that the time cost and the expense cost of software development are greatly reduced.
Drawings
FIG. 1 is a block flow diagram of a workload aware defect prediction method based on a weighted software network according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a weighting software network diagram constructed in an embodiment of the present invention.
Detailed Description
The basic idea of the invention is as follows: the workload perception defect prediction method based on the weighted software network is provided, so that a method for expressing the correlation strength between modules is conveniently designed according to the dependence type and the dependence times between the modules and the contribution of developers to the modules, the workload perception defect prediction method (model) based on the weighted software network is more effectively provided, the probability of the defect occurrence of each software module is obtained, the defect detection sequence of each software module is further provided, the test resource allocation scheme for detecting the most defects with the least workload by a software development company is further provided, the software quality is improved, and the development of software projects is guided.
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Examples
As shown in fig. 1, the workload-aware defect prediction method based on the weighted software network includes the following steps:
I. and collecting and fusing data.
Software system data is collected as well as defect reports from defect tracking systems such as Bugzilla and JIRA. The software system data collected includes source code, code submissions, and version information data in a version control repository (e.g., Git and SVN).
The code submission data includes title, description, discussion, modified file information, submitter, and submission time information. The defect report data includes number, title, description, discussion, repairer, and repair time information.
In this embodiment, the defect report of the defect tracking system is a defect report that has been repaired in the defect tracking system.
And establishing connection between the code submission and the defect report to realize the fusion of the code submission and the defect report data.
The process of code submission and bug reporting to establish a connection is as follows: collecting the number key words of the defect report in the code submission data (title, description and discussion); and if the code submission data contains the number of the defect report, establishing connection between the code submission and the defect report, so that the integration of the code submission and the defect report data is realized.
Defective module marking.
And II.1, extracting text similarity characteristics, personnel consistency characteristics and time stamp characteristics of the code submission and the defect report.
The specific extraction process is as follows:
collecting and fusing code submission and defect report data based on the step I, processing and extracting text information (title, description and discussion) of the code submission and the defect report to obtain text similarity characteristics between the code submission and the defect report;
the above-mentioned processing of text information is realized, for example, by word segmentation and TF-IDF techniques.
And (4) comparing whether the information of the submitter submitted by the codes collected in the step (I) and the repairer of the defect report is the same person, so as to obtain the personnel consistency characteristic. And D, obtaining the difference value between the two times by utilizing the submission time of the codes submitted in the step I and the repair time of the defect report, so as to obtain the time stamp characteristics.
And combining the extracted text similarity, personnel consistency and timestamp characteristics and the established connection of the code submission and the defect report, and finding the missing connection of the code submission and the defect report so as to determine the code submission for repairing the defect.
The present embodiment preferably uses the pu (positive and unlabeled) Learning algorithm to find the missing connection between the code submission and the defect report. The PU Learning algorithm is a known algorithm and will not be described herein.
II.2 repair defect based code submission with improvementsSZZThe algorithm identifies code submissions that introduce defects. Here, use is made of improvementsSZZThe algorithm identifies code submission that introduces defects as a known process and is not described in detail herein.
And II.3, marking each software module based on the code submission of repairing the defects in the step II.1 and the code submission of introducing the defects in the step II.2 by combining version information in a version control warehouse to obtain defect label data of each software module.
The specific marking process is as follows:
the software module modified between the code submission of step ii.1 to repair the defect and the code submission of step ii.2 to introduce the defect is marked as defective, and the other software modules under the same version are marked as non-defective.
Through the marking process, the defect label of the software module under the corresponding version is obtained.
And III, constructing a weighted software network diagram.
And III.1, extracting and aggregating comments in the source code of each software module of the software system to obtain document information of each software module, and obtaining developer information of each software module according to the submitter submitted by the code and the modified file information.
And establishing data dependency and call dependency association between the software modules by using an Understand tool, and establishing collaborative developer association between the software modules by using the same developer information between the software modules.
The process of establishing data dependency and call dependency association between software modules by using an Understand tool is known.
If one of data dependency, call dependency or cooperation developer correlation exists between any two software modules, the correlation exists between the two software modules is indicated, and a software network diagram with the following structure is constructed; wherein:
each software module acts as a node of the software network graph, and an edge is established between the nodes represented by the two associated software modules.
Based on the principle, a software network diagram based on data dependence, call dependence and collaborative developer association is constructed.
And III.2, according to the document information and the developer information of each software module, providing a calculation mode of the association strength between the software modules based on the influence of the quotation, wherein the calculation formula of the association strength is as follows:
Degree_Relation_Strength(A,B)=
Similarity(document(A), document(B))+|&develop(A), develop(B)|/(|develop(A)|+|develop(B)|);
wherein the content of the first and second substances,A,Brepresenting two associated software modules;
Degree_Relation_Strength(A,B) Representing the strength of association between two software modules;
document(.) Document information representing a software module;develop(.) Developer information representing a software module;
Similarity(document(A), document(B) Represent software modulesAAnd software moduleBThe text similarity value of (a);
|develop(.) I represents the number of developers of the software module;
|&develop(A), develop(B) I represents a software moduleAAnd software moduleBThe number of identical developers.
And (4) evaluating the association strength value obtained through the calculation, and assigning the association strength value as a weight of the edge in the software network graph. Meanwhile, the weight values of the sides of the software modules which depend on the same software module are normalized to obtain a weighted software network graph.
Taking fig. 2 as an example, if the software module B depends on the software module A, D, E, the weights of the sides BA, BD, BE are normalized so that the sum of them is equal to 1, and the obtained weighted software network diagram is shown in fig. 2.
Because the correlation strength calculation formula mentioned in this embodiment correctly represents the correlation strength between software modules, the constructed weighted software network diagram can better conform to the relationship between actual software modules.
And IV, obtaining a software network metric element representation of the weighted software network diagram based on a diagram embedding technology.
And IV.1, converting the weighted software network graph in the step III into a probability co-occurrence matrix by adopting a random surfing model, and further converting the probability co-occurrence matrix into a positive mutual information matrix based on all nodes of the weighted software network graph.
The specific generation process of the probability co-occurrence matrix is as follows:
first construct the adjacency matrixAThe transition probability among different nodes in the weighted software network graph is represented as a transition matrixA(ii) a Wherein the transfer matrixAIs composed ofn×nThe matrix is a matrix of a plurality of matrices,nthe number of the nodes is the number of software modules in the software system;
A ij representing a transition matrixAThe element of (1), wherein,iandjrespectively representing different nodes in the weighted software network graph;
if nodeiAnd nodejThere is an edge in between, then willA ij The weight assigned to the edge, otherwise,A ij the value is assigned to 0.
Introducing a row vector to each nodep k =p k1, p k2,……,p kj ,……, p kn Wherein, in the step (A),p kj is shown to pass throughkStep transfer to nodejWhen the probability of (1-a) is considered to return to the original initial node, the row vectorp k The calculation formula is as follows:
p k =(1-a)×p k-1 A+p 0
wherein a represents the probability of returning to the original node again;
p 0initialization is One-Hot encoding (i.e., a vector with only the node corresponding to an index of 1 and the other values of 0).
According to the obtained row vectorp k Value, resulting in a vector representation of each node in the weighted software network graphr=∑ K k=1 p k Wherein, in the step (A),Kthe total number of steps; combining vector representations of each noderAnd generating a probability co-occurrence matrix of the weighted software network graph.
The probability co-occurrence matrix obtained in the above stepsPCOConversion into point mutual information matrixPMIThe conversion formula is as follows:
PMI i,j =log(PCO i,j /(#(i)×#(j)));
wherein the content of the first and second substances,PMI i,j as a point mutual information matrixPMITo middleiGo to the firstjElements of a column;
PCO i,j is a probability co-occurrence matrixPCOTo middleiGo to the firstjElements of a column;
#(.) Representing the sum of the corresponding columns of the probability co-occurrence matrix.
Adjusting the negative value of the point mutual information matrix to obtain a positive point mutual information matrixPPMIThe conversion formula of (c) is as follows:
PPMI i,j =max(PMI i,j ,0);
wherein the content of the first and second substances,PPMI i,j is a positive mutual information matrixPPMITo middleiGo to the firstjElements of a column;
max(PMI i,j 0) represents a if-dot mutual information matrixPMI i,j Is a negative number, thenPPMI i,j The value is assigned to 0.
The positive point mutual information matrix can ensure that a subsequent automatic coding model can capture higher-order approximation.
And IV.2, taking each row of the positive mutual information matrix as the input of one node, inputting the input into an automatic encoder, and performing automatic encoding model training, wherein the automatic encoder consists of an encoder and a decoder.
Loss function of an automatic encoderlossComprises the following steps:loss=∑ i L(x i(),g q2(f q1(x i()) ); wherein the content of the first and second substances,Lis composed ofsample- wiseThe loss of the carbon dioxide gas is reduced,x i()to representPPMITo (1) aiThe number of the row vectors is,g q2is an activation function of the decoder and,f q1as an activation function of the decoder.
By independently learning and weighting the rich structural semantic information among the modules in the software network diagram, the network measurement element which better reflects the data dependence and the call dependence among the software modules and the dependence of the collaborative developer is obtained.
In the embodiment, because the network metric element based on the graph embedding technology correctly represents the dependency relationship of the software modules in the weighted software network graph, the association relationship between the modules can be well utilized to find the defects.
And V, constructing a workload perception defect prediction method.
V.1, extracting a code measurement element and a process measurement element of each software module based on source codes and code submission information in a version control warehouse, wherein the specific process is as follows:
based on the source codes in the version control warehouse collected in the step I, extracting code measurement elements of each software module by adopting an Understand tool, wherein the code measurement elements comprise information such as code line number, function number of classes, annotation line number and the like;
and (3) extracting process measurement elements of each software module based on the code submission information (information such as submitters, modified file information and diff) in the version control warehouse collected in the step I, wherein the process measurement elements comprise information such as adding line number, deleting line number, modifying line number and developer number.
Combining the code metric element, the process metric element and the network metric element in the step IV to form feature data of a defect data set; and (5) combining the characteristic data with the defect label data obtained in the step (II) to jointly form a software defect data set.
V.2, considering that the prediction effect of each type of metric element on the defect is different, this embodiment combines each type of metric element and a combination of at least two types of metric elements (including any two types of combinations and three types of combinations) in the defect data set into different training sets, and trains classifiers using the training sets to form a base learner.
In this embodiment, the classifier preferably uses a random forest classifier, and certainly may also be a classifier such as bayes, logistic regression, or the like. Taking a random forest classifier as an example, a formula for calculating the defect probability of each software module is as follows:
Probability(x i )= ∑ T k=1 a k Probility ck (x i );
wherein the content of the first and second substances,x i is as followsiA software module for the software module, a software module,Probability(x i ) As software modulesiThe defect probability of (2);
Tthe number of the random forest classifiers is shown,T=7;a k is as followskParameters of a random forest classifier;
Probility ck (x i ) First, thekA random forest classifier pairiThe probability of an individual software module;cka base learner is represented.
Base learning deviceC={C CodeC ProcedureC NetworkC Code + procedureC Process + networkC Code + networkC Code + Process + network};
Wherein the content of the first and second substances,C codeA base classifier representing a code metric element;C Procedurea base classifier representing a process metric-based element;C networkA base classifier representing an extracted weighted network metric element;C code + procedureRepresenting a base classifier based on code and process;C process + networkRepresenting a base classifier based on process and network metric elements;C code + networkRepresenting a base classifier based on the code and network metric elements;C code + Process + networkThe representation represents a base classifier based on code, process, and network metric elements.
And V.3, optimizing the combination of the base learners by adopting an optimization algorithm with the aim of optimizing the defects found most by the least workload as the target to form a workload perception defect prediction method, obtaining the probability of the defects of each software module, and further giving the defect detection sequence of each software module. Based on the defect detection sequence of each software module, a test resource allocation scheme for detecting the most defects with the least workload by a software development company can be provided, so that the software quality is improved, and the development of software projects is guided.
On a per basisDefect probability obtained by software moduleProbability(x i ) And sequencing the software modules in the software system from big to small so as to obtain the sequence of the codes to be examined.
Here, if the probability of the software module being defective is equal, the software module with the smaller code line number is ranked in the front.
In order to determine the optimuma k Is optimized by particle swarm optimizationa k Such that the most defects are found when examining code that is 20% of the total code count of a software project; the optimization function is represented as:
cost=|y bug (x i )|/| y(x i )|,ik,∑ k i=1 loc i <0.2×∑ n i=1 loc i
wherein the content of the first and second substances,costrepresents an optimization function-y bug (x i ) I represents the number of defect modules found, count y(x i ) L represents the number of the examined software modules;loc i is shown asiThe number of code lines of each software module;
adjustment ofa k A value of (A) tocostAnd obtaining the optimal combined strong classifier, namely a workload perception defect prediction method, wherein the workload perception defect prediction method can find the most defects with the least workload.
In the embodiment of the invention, the steps are synergistic, and the synergistic process is represented as follows:
firstly, data collection and fusion in the step I provide data guarantee for the invention; and secondly, providing high-quality label data for constructing a workload perception defect prediction method based on a weighted software network based on the defect module marking method provided in the step II, providing high-quality network measurement element representation for constructing the workload perception defect prediction method based on the weighted software network by constructing a weighted software network diagram and embedding software network measurement element representation based on the diagram in the step IV, and finally providing powerful guarantee for defect prediction and application to practice by constructing the workload perception defect prediction method in the step V.
It should be understood, however, that the description herein of specific embodiments is not intended to limit the invention to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.

Claims (9)

1. A workload perception defect prediction method based on a weighted software network is characterized by comprising the following steps:
I. collecting and fusing data;
collecting software system data and a defect report of a defect tracking system; the collected software system data comprises source codes, code submission and version information data in a version control warehouse;
establishing connection between the code submission and the defect report to realize the fusion of the code submission and the defect report data;
II, marking a defect module;
II.1, extracting text similarity characteristics, personnel consistency characteristics and timestamp characteristics of code submission and a defect report;
based on the extracted three types of features and the connection between the code submission and the defect report established in the step I, finding the missing connection between the code submission and the defect report, and determining the code submission for repairing the defect;
II.2, identifying code submissions introducing defects based on the code submissions repairing the defects;
II.3, based on the code submission of repairing the defects in the step II.1 and the code submission of introducing the defects in the step II.2, marking each software module by combining version information in a version control warehouse to obtain defect label data of each software module;
constructing a weighted software network diagram;
extracting and aggregating comments in source codes of each software module of the software system to obtain document information of each software module, and obtaining developer information of each software module according to the information of a submitter submitted by the codes and the information of a modified file;
establishing data dependence and calling dependence correlation among software modules, and simultaneously establishing collaborative developer correlation among the software modules by using the same developer information among the software modules;
if one of data dependency, call dependency or collaborative developer association exists between any two software modules, the association exists between the two software modules, and a software network diagram with the following structure is constructed, wherein:
each software module is used as a node of the software network graph, and an edge is established between the nodes represented by the two associated software modules;
III.2, calculating the association strength between two software modules connected with each edge in the software network graph according to the document information and the developer information of each software module, and assigning the association strength value as the weight of the edge in the software network graph;
normalizing the weight values of the edges of the software modules which depend on the same software module to obtain a weighted software network graph;
obtaining a software network metric element representation of the weighted software network diagram based on a diagram embedding technology;
IV.1, converting the weighted software network diagram in the step III into a probability co-occurrence matrix, converting the probability co-occurrence matrix into a point mutual information matrix, and adjusting a negative value of the obtained point mutual information matrix to obtain a positive point mutual information matrix;
taking each row of the positive point mutual information matrix as the input of a node, training an automatic coding model, and autonomously learning the rich inter-module structural semantic information in the weighted software network graph to obtain the network metric element representation of the weighted software network graph;
v, constructing a workload perception defect prediction method;
v.1, extracting a code measurement element and a process measurement element of each software module based on source codes and code submission information in a version control warehouse, and combining the network measurement elements obtained in the step IV to form feature data of a defect data set;
combining the characteristic data with the defect label data obtained in the step II to jointly form a software defect data set;
v.2, combining each type of metric element and the combination of at least two types of metric elements in the defect data set into different training sets respectively, and training classifiers by using the training sets to form a base learner;
and V.3, optimizing the combination of the base learners by adopting an optimization algorithm with the aim of optimizing the defects found most by the least workload as the target to form a workload perception defect prediction method, obtaining the probability of the defects of each software module, and further giving the defect detection sequence of each software module.
2. The weighted software network-based workload-aware defect prediction method of claim 1,
in the step I, number keywords of the defect report are collected in the code submission data, if the code submission data contains the number of the defect report, connection is established between the code submission and the defect report, and the code submission and the defect report data are fused.
3. The weighted software network-based workload-aware defect prediction method of claim 1,
in the step ii.1, the text similarity feature extraction process is as follows:
collecting and fusing code submission and defect report data based on the step I, and processing and extracting text information of the code submission and the defect report through word segmentation and TF-IDF technology to obtain text similarity characteristics between the code submission and the defect report;
in the step II.1, the extraction process of the personnel consistency characteristics is as follows:
extracting personnel consistency characteristics by utilizing the information of the submitter submitted by the codes collected in the step I and the repairer of the defect report;
in the step ii.1, the timestamp feature extraction process is as follows:
and D, extracting the time stamp characteristics by using the submission time of the code submission collected in the step I and the repair time of the defect report.
4. The weighted software network-based workload-aware defect prediction method of claim 1,
in the step ii.3, the software module modified between the code submission of repairing the defect in the step ii.1 and the code submission of introducing the defect in the step ii.2 is marked as a defective module, and other software modules in the same version are marked as non-defective modules.
5. The weighted software network-based workload-aware defect prediction method of claim 1,
in the step iii.2, the calculation formula of the correlation strength between the software modules is as follows:
Degree_Relation_Strength(A,B)=
Similarity(document(A), document(B))+|&develop(A), develop(B)|/(|develop(A)|+|develop(B)|);
wherein the content of the first and second substances,A,Brepresenting two associated software modules;
Degree_Relation_Strength(A,B) Representing the strength of association between two software modules;
document(.) Document information representing a software module;develop(.) Developer information representing a software module;
Similarity(document(A), document(B) Represent software modulesAAnd software moduleBThe text similarity value of (a);
|develop(.) I represents the number of developers of the software module;
|&develop(A), develop(B) I represents a software moduleAAnd software moduleBThe number of identical developers.
6. The weighted software network-based workload-aware defect prediction method of claim 1,
in IV.1, generating probability co-occurrence matrix of weighted software network diagramPCOThe process is as follows:
first construct the adjacency matrixAThe transition probability among different nodes in the weighted software network graph is represented as a transition matrixA(ii) a Wherein the transfer matrixAIs composed ofn×nThe matrix is a matrix of a plurality of matrices,nthe number of the nodes is the number of software modules in the software system;
A ij representing a transition matrixAThe element of (1), wherein,iandjrespectively representing different nodes in the weighted software network graph;
if nodeiAnd nodejThere is an edge in between, then willA ij The weight assigned to the edge, otherwise,A ij the value is assigned to 0;
introducing a row vector to each nodep k =p k1, p k2,……,p kj ,……, p kn Wherein, in the step (A),p kj is shown to pass throughkStep transfer to nodejProbability of (2) is the row vectorp k The calculation formula is as follows:p k =(1-a)×p k-1 A+p 0
where, a represents the probability of returning to the original node again,p 0initializing to One-Hot coding; according to the obtained row vectorp k Value, resulting in a vector representation of each node in the weighted software network graphr=∑ K k=1 p k Wherein, in the step (A),Kthe total number of steps;
combining vector representations of each noderGenerating a probability co-occurrence matrix of the weighted software network graphPCO
7. The weighted software network-based workload-aware defect prediction method of claim 6,
in IV.1, probability co-occurrence matrixPCOConversion into point mutual information matrixPMIThe conversion formula of (c) is as follows:
PMI i,j =log(PCO i,j /(#(i)×#(j)));
wherein the content of the first and second substances,PMI i,j as a point mutual information matrixPMITo middleiGo to the firstjElements of a column;
PCO i,j is a probability co-occurrence matrixPCOTo middleiGo to the firstjElements of a column;
#(.) Representing the summation value of the corresponding column of the probability co-occurrence matrix;
adjusting the negative value of the point mutual information matrix to obtain a positive point mutual information matrixPPMIThe conversion formula of (c) is as follows:
PPMI i,j =max(PMI i,j ,0);
wherein the content of the first and second substances,PPMI i,j is a positive mutual information matrixPPMITo middleiGo to the firstjElements of a column;
max(PMI i,j 0) represents a if-dot mutual information matrixPMI i,j Is a negative number, thenPPMI i,j The value is assigned to 0.
8. The weighted software network-based workload-aware defect prediction method of claim 1,
in the step v.1, the extraction process of the code metric element and the process metric element is as follows:
based on the source codes in the version control warehouse collected in the step I, extracting a code measurement element of each software module by adopting an Understand tool, wherein the code measurement element comprises code line number, function number of classes and annotation line number information;
and (3) extracting process measurement elements of each software module based on the code submission information in the version control warehouse collected in the step I, wherein the process measurement elements comprise the number of adding lines, the number of deleting lines, the number of modifying lines and developer number information.
9. The weighted software network-based workload-aware defect prediction method of claim 1,
in the step V.2, a random forest classifier is adopted as the classifier;
forming a formula for calculating the defect probability of each software module based on the random forest classifier, wherein the formula is shown as the following formula;
Probability(x i )= ∑ T k=1 a k Probility ck (x i );
wherein the content of the first and second substances,x i is as followsiA software module for the software module, a software module,Probability(x i ) As software modulesiThe defect probability of (2);
Tthe number of the random forest classifiers is shown,T=7,a k is as followskParameters of a random forest classifier;
Probility ck (x i ) First, thekA random forest classifier pairiThe probability of an individual software module;ckrepresenting a basis learner;
defect probability based on each software moduleProbability(x i ) Sorting the software modules in the software system from big to small to obtain the sequence of the codes to be examined;
to determine the bestIs superiora k Is optimized by particle swarm optimizationa k Such that the most defects are found when examining code that is 20% of the total code count of a software project; the optimization function is represented as:
cost=|y bug (x i )|/| y(x i )|,ik,∑ k i=1 loc i <0.2×∑ n i=1 loc i
wherein the content of the first and second substances,costrepresents an optimization function-y bug (x i ) I represents the number of defect modules found, count y(x i ) L represents the number of the examined software modules;loc i is shown asiThe number of code lines of each software module;
adjustment ofa k A value of (A) tocostAnd obtaining the optimal combined strong classifier, namely a workload perception defect prediction method.
CN202110323054.1A 2021-03-26 2021-03-26 Workload perception defect prediction method based on weighted software network Active CN112711543B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110323054.1A CN112711543B (en) 2021-03-26 2021-03-26 Workload perception defect prediction method based on weighted software network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110323054.1A CN112711543B (en) 2021-03-26 2021-03-26 Workload perception defect prediction method based on weighted software network

Publications (2)

Publication Number Publication Date
CN112711543A true CN112711543A (en) 2021-04-27
CN112711543B CN112711543B (en) 2021-06-08

Family

ID=75550289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110323054.1A Active CN112711543B (en) 2021-03-26 2021-03-26 Workload perception defect prediction method based on weighted software network

Country Status (1)

Country Link
CN (1) CN112711543B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113986602A (en) * 2021-12-27 2022-01-28 广州锦行网络科技有限公司 Software identification method and device, storage medium and electronic equipment
CN114782967A (en) * 2022-03-21 2022-07-22 南京航空航天大学 Software defect prediction method based on code visualization learning
CN117909313A (en) * 2024-03-19 2024-04-19 成都融见软件科技有限公司 Distributed storage method for design code data, electronic equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838671A (en) * 2014-01-26 2014-06-04 北京理工大学 Software defect measuring method based on complex network
US20150261649A1 (en) * 2014-03-13 2015-09-17 International Business Machines Corporation Method for performance monitoring and optimization via trend detection and forecasting
CN107665172A (en) * 2017-10-20 2018-02-06 北京理工大学 A kind of Software Defects Predict Methods based on complicated weighting software network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838671A (en) * 2014-01-26 2014-06-04 北京理工大学 Software defect measuring method based on complex network
US20150261649A1 (en) * 2014-03-13 2015-09-17 International Business Machines Corporation Method for performance monitoring and optimization via trend detection and forecasting
CN107665172A (en) * 2017-10-20 2018-02-06 北京理工大学 A kind of Software Defects Predict Methods based on complicated weighting software network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CESAR COUTO等: "Uncovering Causal Relationships between Software Metrics and Bugs", 《 2012 16TH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING》 *
马皖王莹等: "基于复杂网络分析的软件高危缺陷评估方法", 《计算机科学与探索》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113986602A (en) * 2021-12-27 2022-01-28 广州锦行网络科技有限公司 Software identification method and device, storage medium and electronic equipment
CN113986602B (en) * 2021-12-27 2022-04-15 广州锦行网络科技有限公司 Software identification method and device, storage medium and electronic equipment
CN114782967A (en) * 2022-03-21 2022-07-22 南京航空航天大学 Software defect prediction method based on code visualization learning
CN114782967B (en) * 2022-03-21 2024-02-20 南京航空航天大学 Software defect prediction method based on code visual chemistry
CN117909313A (en) * 2024-03-19 2024-04-19 成都融见软件科技有限公司 Distributed storage method for design code data, electronic equipment and medium
CN117909313B (en) * 2024-03-19 2024-05-14 成都融见软件科技有限公司 Distributed storage method for design code data, electronic equipment and medium

Also Published As

Publication number Publication date
CN112711543B (en) 2021-06-08

Similar Documents

Publication Publication Date Title
CN112711543B (en) Workload perception defect prediction method based on weighted software network
CN110866536B (en) Cross-regional enterprise tax evasion identification method based on PU learning
CN111428021A (en) Text processing method and device based on machine learning, computer equipment and medium
CN113065358B (en) Text-to-semantic matching method based on multi-granularity alignment for bank consultation service
CN112182219A (en) Online service abnormity detection method based on log semantic analysis
JP6976910B2 (en) Data classification system, data classification method, and data classification device
CN111581116B (en) Cross-project software defect prediction method based on hierarchical data screening
CN110888798B (en) Software defect prediction method based on graph convolution neural network
Wu et al. Improving vqa and its explanations\\by comparing competing explanations
CN111582506A (en) Multi-label learning method based on global and local label relation
CN115964273A (en) Spacecraft test script automatic generation method based on deep learning
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN110245235B (en) Text classification auxiliary labeling method based on collaborative training
Liu et al. Uncertain label correction via auxiliary action unit graphs for facial expression recognition
CN110858176B (en) Code quality evaluation method, device, system and storage medium
Arora et al. Ask it right! identifying low-quality questions on community question answering services
CN115019183B (en) Remote sensing image model migration method based on knowledge distillation and image reconstruction
CN115438190B (en) Power distribution network fault auxiliary decision knowledge extraction method and system
CN116401289A (en) Traceability link automatic recovery method based on multi-source information combination
CN116843175A (en) Contract term risk checking method, system, equipment and storage medium
CN113835739A (en) Intelligent prediction method for software defect repair time
Gaut et al. Improving Government Response to Citizen Requests Online
CN114328169A (en) Dynamic page testing method and system
Sameki et al. BUOCA: budget-optimized crowd worker allocation
CN112651246B (en) Service demand conflict detection method integrating deep learning and workflow modes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant