CN111240993B

CN111240993B - Software defect prediction method based on module dependency graph

Info

Publication number: CN111240993B
Application number: CN202010066087.8A
Authority: CN
Inventors: 原仓周; 柯鑫鑫; 詹盼盼; 齐征
Original assignee: Beihang University
Current assignee: Tianhang Changying (Jiangsu) Technology Co.,Ltd.
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2021-05-14
Anticipated expiration: 2040-01-20
Also published as: CN111240993A

Abstract

The software defect prediction method based on the module dependency graph provided by the disclosure identifies the defect information of a software module according to the version information of software to be analyzed; establishing a software module dependency graph according to the dependency relationship among the software modules, and taking developers as nodes in the module dependency graph; extracting internal features of the software module, extracting the dependency features of each node in a software module dependency graph in a network representation learning mode, forming the internal features and the dependency features between the modules into measurement tuples, and establishing a historical defect library of the software according to the measurement tuples and defect information of the modules; and training a defect prediction model by utilizing a historical defect library for predicting the defects of subsequent software, wherein the defect prediction adopts a classifier dynamic selection model based on local optimum, parameters of the defect prediction model are automatically optimized, and the result of the defect prediction model of the software module is used as the defect prediction result of the software to be analyzed. The method can improve the flexibility of constructing the network node measurement element and improve the effect of software defect prediction.

Description

Software defect prediction method based on module dependency graph

Technical Field

The invention belongs to the technical field of software quality assurance, and particularly relates to a software defect prediction method based on a module dependency graph.

Background

The software defect prediction is a very important research subject in software engineering, and the static software defect prediction technology based on measurement predicts the defects of a new software module by means of historical data obtained from the existing software module so as to judge whether the new software module has defects or not, thereby providing decision support for a software project. Machine learning technology is mostly adopted in existing software defect prediction research, and software defect prediction generally comprises the following steps: 1) marking module categories, wherein the software modules can be divided into two categories of defective modules and non-defective modules; 2) extracting module attributes, and measuring the software module by using methods such as McCabe measurement, Halstead measurement and the like to obtain the attributes of the software module; 3) establishing a prediction model, and obtaining a classifier by learning according to the category and attribute information of the software module by using a machine learning method; 4) and predicting the new module, and predicting the attribute of the new software module by using the classifier according to the attribute of the new software module so as to judge whether the module contains defects.

The method is characterized in that a measurement element which is strongly related to the software defect is set, and is the key for constructing a high-quality defect prediction model. The more complex the dependency relationship between modules, the more likely defects will occur, so the network metric elements of the modules can be used for defect prediction.

The Nachiappan et al think that the module is easy to have defects if the module has higher dependency relationship, and the author has a contribution point that the relationship between the network metric element and the defects is firstly put forward, the network metric element is extracted by using a centrality method in a social network, and the network metric element has better prediction effect by comparing with the complexity metric element in the module.

The designed measurement tuple is { LOCODE, LOCOM, INS, OUTS, Cluscoe, BetCen }, the first two indexes reflect the complexity inside the network nodes of the module dependency graph, and the last four indexes extract the coupling degree between the nodes from the module dependency graph. The patent utilizes a support vector machine algorithm to construct a defect prediction model.

Existing research mainly relies on user-defined structural feature metrics (such as degree statistics or centrality metrics) to describe the structural features of the nodes, and the lack of flexibility causes difficulty in extracting network node features. Developers are also responsible for the creation of defects, and there has been little research to take this into account. In order to solve the problems, a software defect prediction method based on a module dependency graph is provided.

Disclosure of Invention

In view of this, the embodiment of the present application provides a software defect prediction method based on a module dependency graph by using developers as network nodes in the module dependency graph, which can improve the flexibility of constructing network node measurement elements and improve the effect of software defect prediction.

According to an aspect of the present disclosure, there is provided a method for defect prediction based on a dependency graph of a software module, the method including:

s1: identifying the defect information of the software module according to the version information of the software to be analyzed;

s2: establishing a software module dependency graph according to the dependency relationship among the software modules, and taking developers as nodes in the module dependency graph;

s3: extracting internal features of the software module, extracting the dependency features of each node in the software module dependency graph by adopting a network representation learning mode, forming the internal features and the dependency features into a measurement tuple, and establishing a historical defect library of the software according to the measurement tuple and the defect information of the module;

s4: and training a defect prediction model by utilizing a historical defect library for predicting the subsequent software defects, wherein the software module defect prediction model adopts a classifier dynamic selection based on local optimum, parameters of the module defect prediction model are automatically optimized, and the result of the software module defect prediction model is used as the defect prediction result of the software to be analyzed.

Identifying the defect information of the software module according to the version information of the software to be analyzed; establishing a software module dependency graph according to the dependency relationship among the software modules, and taking developers as nodes in the module dependency graph; extracting internal features of a software module, extracting the dependency features of each node in a software module dependency graph by adopting a network representation learning mode, forming the internal features and the dependency features into a measurement tuple, and establishing a historical defect library of the software according to the measurement tuple and the defect information of the module; and training a defect prediction model by utilizing a historical defect library for predicting the subsequent software defects, wherein the software module defect prediction model adopts a classifier dynamic selection based on local optimum, parameters of the module defect prediction model are automatically optimized, and the result of the software module defect prediction model is used as the defect prediction result of the software to be analyzed.

Drawings

FIG. 1 illustrates a flow diagram of a software bug prediction method based on a module dependency graph according to an embodiment of the present disclosure;

FIG. 2 illustrates an overall flow diagram of a software bug prediction method based on a module dependency graph according to an embodiment of the present disclosure;

FIG. 3 illustrates a software module dependency graph of a software defect prediction method based on a module dependency graph according to an embodiment of the present disclosure.

FIG. 4 illustrates a software module dependency graph of a software defect prediction method based on a module dependency graph according to another embodiment of the present disclosure.

FIG. 5 illustrates a defect prediction model of a software defect prediction method based on a module dependency graph according to an embodiment of the present disclosure.

FIG. 6 illustrates a flow diagram of a hyperparametric optimization of classifiers for a software bug prediction method based on a module dependency graph according to an embodiment of the present disclosure.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the drawings are described in detail below.

FIG. 1 illustrates a flow diagram of a software bug prediction method based on a module dependency graph according to an embodiment of the present disclosure; FIG. 2 illustrates an overall flow diagram of a software bug prediction method based on a module dependency graph according to an embodiment of the present disclosure.

As shown in fig. 1, the software defect prediction method of the present disclosure includes:

s1: and identifying the defect information of the software module according to the version information of the software to be analyzed.

As shown in fig. 2, which software files have defects are identified in the C source code software version library to be analyzed according to the C source code software version information commit and issue information. The main identification method is as follows: the commits contain keywords 'fixed, closed, fix' for repairing defects and are followed by the number of issue; and then counting which files are changed by the commats to repair the defects, wherein the changed files are files containing the defects.

fig. 3 and 4 respectively show software module dependency graphs of a software defect prediction method based on the module dependency graphs according to an embodiment of the present disclosure.

For items of C source code software, the definition module dependency network MDN is a directed graph: MDN ═ V, V denotes the set of all nodes, for each node V ∈ V denotes a module in the project (source code file in the C language project), and the edge set E denotes the dependency of the module. The dependency relationship between two modules can be divided into two categories, data dependency and function call dependency. As shown in FIG. 3, the dependency between C and A represented by C- > A is data dependency, and the dependency between C and A represented by B- > A is function call dependency.

In the development process, developers can also cause certain defects, the developers can be used as nodes of a software module dependency graph to be constructed in the module dependency graph, and if the developers modify a certain software module, the developers and the software module have dependency relationship. As shown in fig. 4, the developer 1 commits the software module a and the software module B, the developer 2 commits the module C, and constructs a software module dependency graph by considering the

developers

1 and 2 as nodes on the basis of the module dependency graph, and so on, and can construct a software module dependency graph by considering other developers as nodes.

as shown in FIG. 2, the internal features of the software module may include code scale features and code structure features. And extracting the code scale characteristics of the software module by using a LOC measurement tuple { blank line, comment line, total code line, executable code line } and a Helstead measurement tuple { sum of all operators and operands, program capacity, program length, complexity, workload, operator types, operand types, operator numbers and operand numbers }, and measuring the code scale characteristics of the software module. And selecting a McCabe measurement tuple { circle complexity and basic complexity } to extract the code structure characteristics of the software module, and measuring the code structure characteristics.

As shown in FIG. 2, a node2vec network representation learning method is adopted to extract the dependency characteristics of the nodes in the software module dependency graph. The node2vec method mainly uses the word2vec thought processed by natural language for reference. The node2vec generates a random walk sequence by using a breadth-first search strategy and a depth-first search strategy, and controls the jump probability of the random walk sequence by using parameters p and q. The parameter p controls the extent of the wandering, and the parameter q controls the depth of the wandering, so that more homogeneous information or isomorphic information can be acquired by the wandering sequence by selecting different p and q combinations. In an example, p may be designed to have a value of 0.25, q may have a value of 2, the step size of the walk is 7, the number of walks per node is 80, and the resulting number of network metric meta-features per software module is 128 dimensions.

Finally, combining the LOC measurement tuple, the Helstead measurement tuple, the McCabe measurement tuple and the network measurement tuple together to form a measurement tuple, and establishing a historical defect library of the software according to the measurement tuple and the defect information of the module;

s4: training a defect prediction model by utilizing a historical defect library for predicting subsequent software defects, dynamically selecting the software module defect prediction model by adopting a classifier based on local optimum, automatically optimizing parameters of the module defect prediction model, and taking the result of the software module defect prediction model as the defect prediction result of software to be analyzed;

as shown in fig. 5, for each sample to be tested (software to be analyzed), k neighbors of the sample to be tested in the training set are found, and it is determined which trained classification algorithm of the k neighbor training samples has the best prediction effect, so as to implement a dynamic selection classifier. In one example, k can be set to 8, the base classifier adopts an SVM, naive Bayes, logistic regression, random deep forest and other dynamic models for defect prediction, a locally optimal dynamic model (SVM, naive Bayes, logistic regression, random deep forest) can be adopted to train a software module training set as a software module defect prediction model, the classifier models of SVM, naive Bayes, logistic regression, random deep forest and the like are optimized by using a genetic algorithm for hyperreference due to different data distribution in different software defect libraries, for example, the hyperreference of each base classifier is used as a gene of the genetic algorithm to form a chromosome, the initial value of the population can be set to 50, the genetic algebra can be set to 100, the fitness can be the F-measure value of defect prediction, the optimal offspring is reserved by using an elite strategy, and then the parameters of the block defect prediction model are automatically optimized, and taking the result of the software module defect prediction model as the defect prediction result of the software to be analyzed.

Identifying the defect information of the software module according to the version information of the software to be analyzed; establishing a software module dependency graph according to the dependency relationship among the software modules, and taking developers as nodes in the module dependency graph; extracting internal features of a software module, extracting the dependency features of each node in a software module dependency graph by adopting a network representation learning mode, forming the internal features and the dependency features into a measurement tuple, and establishing a historical defect library of the software according to the measurement tuple and the defect information of the module; and training a defect prediction model by utilizing a historical defect library for predicting the subsequent software defects, wherein the software module defect prediction model adopts a classifier dynamic selection based on local optimum, parameters of the module defect prediction model are automatically optimized, and the result of the software module defect prediction model is used as the defect prediction result of the software to be analyzed. The method can improve the flexibility of constructing the network node measurement element and improve the effect of software defect prediction.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A software defect prediction method based on a module dependency graph is characterized by comprising the following steps:

s3: extracting the internal features of the software modules, extracting the dependency features of each node in the software module dependency graph by adopting a node2vec network representation learning mode, and finally obtaining the network measurement tuple of each software module; forming a measurement tuple by the internal features and the dependency features, and establishing a historical defect library of the software according to the measurement tuple and the defect information of the module;

the internal features comprise code scale features and code structure features; the code scale model features are extracted by using LOC (local area network) measurement tuples and Helstead measurement tuples; extracting the code structure characteristics by using McCabe measurement tuples;

combining the LOC measurement tuple, the Helstead measurement tuple, the McCabe measurement tuple and the network measurement tuple together to form the measurement tuple;

s4: training a defect prediction model by utilizing a historical defect library for predicting subsequent software defects, dynamically selecting the software module defect prediction model by adopting a classifier based on local optimum, carrying out hyper-parametric optimization on the module defect prediction model by utilizing a genetic algorithm, and taking the result of the software module defect prediction model as the defect prediction result of a software module to be analyzed;

the implementation process of the dynamic selection of the classifier specifically comprises the steps of finding k neighbors of each sample to be tested, namely software to be analyzed, in a training set, judging which trained classification algorithm has the best prediction effect of the k neighbor training samples, and achieving the dynamic selection of the classifier.