Background
The software defect prediction is a very important research subject in software engineering, and the static software defect prediction technology based on measurement predicts the defects of a new software module by means of historical data obtained from the existing software module so as to judge whether the new software module has defects or not, thereby providing decision support for a software project. Machine learning technology is mostly adopted in existing software defect prediction research, and software defect prediction generally comprises the following steps: 1) marking module categories, wherein the software modules can be divided into two categories of defective modules and non-defective modules; 2) extracting module attributes, and measuring the software module by using methods such as McCabe measurement, Halstead measurement and the like to obtain the attributes of the software module; 3) establishing a prediction model, and obtaining a classifier by learning according to the category and attribute information of the software module by using a machine learning method; 4) and predicting the new module, and predicting the attribute of the new software module by using the classifier according to the attribute of the new software module so as to judge whether the module contains defects.
The method is characterized in that a measurement element which is strongly related to the software defect is set, and is the key for constructing a high-quality defect prediction model. The more complex the dependency relationship between modules, the more likely defects will occur, so the network metric elements of the modules can be used for defect prediction.
The Nachiappan et al think that the module is easy to have defects if the module has higher dependency relationship, and the author has a contribution point that the relationship between the network metric element and the defects is firstly put forward, the network metric element is extracted by using a centrality method in a social network, and the network metric element has better prediction effect by comparing with the complexity metric element in the module.
The designed measurement tuple is { LOCODE, LOCOM, INS, OUTS, Cluscoe, BetCen }, the first two indexes reflect the complexity inside the network nodes of the module dependency graph, and the last four indexes extract the coupling degree between the nodes from the module dependency graph. The patent utilizes a support vector machine algorithm to construct a defect prediction model.
Existing research mainly relies on user-defined structural feature metrics (such as degree statistics or centrality metrics) to describe the structural features of the nodes, and the lack of flexibility causes difficulty in extracting network node features. Developers are also responsible for the creation of defects, and there has been little research to take this into account. In order to solve the problems, a software defect prediction method based on a module dependency graph is provided.
Disclosure of Invention
In view of this, the embodiment of the present application provides a software defect prediction method based on a module dependency graph by using developers as network nodes in the module dependency graph, which can improve the flexibility of constructing network node measurement elements and improve the effect of software defect prediction.
According to an aspect of the present disclosure, there is provided a method for defect prediction based on a dependency graph of a software module, the method including:
s1: identifying the defect information of the software module according to the version information of the software to be analyzed;
s2: establishing a software module dependency graph according to the dependency relationship among the software modules, and taking developers as nodes in the module dependency graph;
s3: extracting internal features of the software module, extracting the dependency features of each node in the software module dependency graph by adopting a network representation learning mode, forming the internal features and the dependency features into a measurement tuple, and establishing a historical defect library of the software according to the measurement tuple and the defect information of the module;
s4: and training a defect prediction model by utilizing a historical defect library for predicting the subsequent software defects, wherein the software module defect prediction model adopts a classifier dynamic selection based on local optimum, parameters of the module defect prediction model are automatically optimized, and the result of the software module defect prediction model is used as the defect prediction result of the software to be analyzed.
Identifying the defect information of the software module according to the version information of the software to be analyzed; establishing a software module dependency graph according to the dependency relationship among the software modules, and taking developers as nodes in the module dependency graph; extracting internal features of a software module, extracting the dependency features of each node in a software module dependency graph by adopting a network representation learning mode, forming the internal features and the dependency features into a measurement tuple, and establishing a historical defect library of the software according to the measurement tuple and the defect information of the module; and training a defect prediction model by utilizing a historical defect library for predicting the subsequent software defects, wherein the software module defect prediction model adopts a classifier dynamic selection based on local optimum, parameters of the module defect prediction model are automatically optimized, and the result of the software module defect prediction model is used as the defect prediction result of the software to be analyzed.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the drawings are described in detail below.
FIG. 1 illustrates a flow diagram of a software bug prediction method based on a module dependency graph according to an embodiment of the present disclosure; FIG. 2 illustrates an overall flow diagram of a software bug prediction method based on a module dependency graph according to an embodiment of the present disclosure.
As shown in fig. 1, the software defect prediction method of the present disclosure includes:
s1: and identifying the defect information of the software module according to the version information of the software to be analyzed.
As shown in fig. 2, which software files have defects are identified in the C source code software version library to be analyzed according to the C source code software version information commit and issue information. The main identification method is as follows: the commits contain keywords 'fixed, closed, fix' for repairing defects and are followed by the number of issue; and then counting which files are changed by the commats to repair the defects, wherein the changed files are files containing the defects.
S2: establishing a software module dependency graph according to the dependency relationship among the software modules, and taking developers as nodes in the module dependency graph;
fig. 3 and 4 respectively show software module dependency graphs of a software defect prediction method based on the module dependency graphs according to an embodiment of the present disclosure.
For items of C source code software, the definition module dependency network MDN is a directed graph: MDN ═ V, V denotes the set of all nodes, for each node V ∈ V denotes a module in the project (source code file in the C language project), and the edge set E denotes the dependency of the module. The dependency relationship between two modules can be divided into two categories, data dependency and function call dependency. As shown in FIG. 3, the dependency between C and A represented by C- > A is data dependency, and the dependency between C and A represented by B- > A is function call dependency.
In the development process, developers can also cause certain defects, the developers can be used as nodes of a software module dependency graph to be constructed in the module dependency graph, and if the developers modify a certain software module, the developers and the software module have dependency relationship. As shown in fig. 4, the developer 1 commits the software module a and the software module B, the developer 2 commits the module C, and constructs a software module dependency graph by considering the developers 1 and 2 as nodes on the basis of the module dependency graph, and so on, and can construct a software module dependency graph by considering other developers as nodes.
S3: extracting internal features of the software module, extracting the dependency features of each node in the software module dependency graph by adopting a network representation learning mode, forming the internal features and the dependency features into a measurement tuple, and establishing a historical defect library of the software according to the measurement tuple and the defect information of the module;
as shown in FIG. 2, the internal features of the software module may include code scale features and code structure features. And extracting the code scale characteristics of the software module by using a LOC measurement tuple { blank line, comment line, total code line, executable code line } and a Helstead measurement tuple { sum of all operators and operands, program capacity, program length, complexity, workload, operator types, operand types, operator numbers and operand numbers }, and measuring the code scale characteristics of the software module. And selecting a McCabe measurement tuple { circle complexity and basic complexity } to extract the code structure characteristics of the software module, and measuring the code structure characteristics.
As shown in FIG. 2, a node2vec network representation learning method is adopted to extract the dependency characteristics of the nodes in the software module dependency graph. The node2vec method mainly uses the word2vec thought processed by natural language for reference. The node2vec generates a random walk sequence by using a breadth-first search strategy and a depth-first search strategy, and controls the jump probability of the random walk sequence by using parameters p and q. The parameter p controls the extent of the wandering, and the parameter q controls the depth of the wandering, so that more homogeneous information or isomorphic information can be acquired by the wandering sequence by selecting different p and q combinations. In an example, p may be designed to have a value of 0.25, q may have a value of 2, the step size of the walk is 7, the number of walks per node is 80, and the resulting number of network metric meta-features per software module is 128 dimensions.
Finally, combining the LOC measurement tuple, the Helstead measurement tuple, the McCabe measurement tuple and the network measurement tuple together to form a measurement tuple, and establishing a historical defect library of the software according to the measurement tuple and the defect information of the module;
s4: training a defect prediction model by utilizing a historical defect library for predicting subsequent software defects, dynamically selecting the software module defect prediction model by adopting a classifier based on local optimum, automatically optimizing parameters of the module defect prediction model, and taking the result of the software module defect prediction model as the defect prediction result of software to be analyzed;
as shown in fig. 5, for each sample to be tested (software to be analyzed), k neighbors of the sample to be tested in the training set are found, and it is determined which trained classification algorithm of the k neighbor training samples has the best prediction effect, so as to implement a dynamic selection classifier. In one example, k can be set to 8, the base classifier adopts an SVM, naive Bayes, logistic regression, random deep forest and other dynamic models for defect prediction, a locally optimal dynamic model (SVM, naive Bayes, logistic regression, random deep forest) can be adopted to train a software module training set as a software module defect prediction model, the classifier models of SVM, naive Bayes, logistic regression, random deep forest and the like are optimized by using a genetic algorithm for hyperreference due to different data distribution in different software defect libraries, for example, the hyperreference of each base classifier is used as a gene of the genetic algorithm to form a chromosome, the initial value of the population can be set to 50, the genetic algebra can be set to 100, the fitness can be the F-measure value of defect prediction, the optimal offspring is reserved by using an elite strategy, and then the parameters of the block defect prediction model are automatically optimized, and taking the result of the software module defect prediction model as the defect prediction result of the software to be analyzed.
Identifying the defect information of the software module according to the version information of the software to be analyzed; establishing a software module dependency graph according to the dependency relationship among the software modules, and taking developers as nodes in the module dependency graph; extracting internal features of a software module, extracting the dependency features of each node in a software module dependency graph by adopting a network representation learning mode, forming the internal features and the dependency features into a measurement tuple, and establishing a historical defect library of the software according to the measurement tuple and the defect information of the module; and training a defect prediction model by utilizing a historical defect library for predicting the subsequent software defects, wherein the software module defect prediction model adopts a classifier dynamic selection based on local optimum, parameters of the module defect prediction model are automatically optimized, and the result of the software module defect prediction model is used as the defect prediction result of the software to be analyzed. The method can improve the flexibility of constructing the network node measurement element and improve the effect of software defect prediction.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.