CN108829607B

CN108829607B - Software defect prediction method based on convolutional neural network

Info

Publication number: CN108829607B
Application number: CN201810743379.3A
Authority: CN
Inventors: 陆璐; 邱少健
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-07-09
Filing date: 2018-07-09
Publication date: 2021-08-10
Anticipated expiration: 2038-07-09
Also published as: CN108829607A

Abstract

The invention discloses a software defect prediction method based on a convolutional neural network, which comprises the following steps of: analyzing each file source code in the software project to form an AST Token vector set; establishing mapping between integers and Token, and converting AST Token vector into numerical value vector; carrying out classification unbalance processing on the numerical vector set data by using an SMOTE technology; constructing a convolutional neural network on the basis of the numerical vector set, and extracting a feature vector capable of expressing the semantics of the code; combining the features of convolutional neural network learning with the traditional manual static features; and inputting the data set with the merged features into a support vector machine classifier, and training a software defect prediction model. The method can be directly applied to the defect prediction task of actual software, can capture the semantic features of the source code, solves the problem of missing on semantic feature analysis in the traditional method, and further improves the accuracy of a defect prediction model.

Description

Software defect prediction method based on convolutional neural network

Technical Field

The invention relates to the field of software analysis and defect prediction in software engineering, in particular to a software defect prediction method based on a convolutional neural network.

Background

Potential and unknown defects in the software seriously affect the quality of the software, so that software analysis and defect prediction technology plays an important role in software quality assurance tasks. If the software defect can be found early, the software team can be helped to know the quality state of the current project, and then test resources can be reasonably distributed. However, manual review of all code units in a project is impractical, and as a result, an increasing number of software engineering research and practitioners are beginning to focus heavily on machine learning-based software defect prediction techniques and attempt to use multiple machine learning methods to detect potentially defective modules and files in software.

Firstly, extracting code characteristics from a source file of a software project, wherein the code characteristics mainly comprise Halstead characteristics based on operators and operands, McCabe characteristics based on dependency, CK characteristics based on object-oriented programs, Change characteristics based on code Change history and the like. These features mainly utilize manual features that software analysts refine. Furthermore, the software defect prediction method uses a machine learning algorithm to perform supervised learning on historical defect data to obtain a classification model, namely, on the basis of performing manual feature extraction on the historical data, a base classifier such as a support vector machine, a random forest, a logistic regression and the like is used for learning the defect prediction model. The predicted results will help the software quality assurance team find code regions that may contain defects.

However, the manual features utilized by the traditional software defect prediction method based on machine learning cannot capture and utilize abundant semantic features in software codes, so that the prediction performance is often not ideal. The convolutional neural network is one of powerful technologies for automatic feature generation, can effectively capture complex nonlinear features in software codes, and can solve the problem that semantic features are missing from manual features in a software defect prediction method.

However, it is a difficult problem for those skilled in the art to implement the application of the convolutional neural network in the field of software defect detection.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a software defect prediction method based on a convolutional neural network. The method is easy to operate, a set of prediction results of the defect tendency of each file of the software project can be provided only by submitting a software source code on the basis of a standard software defect prediction process, and a reference basis is provided for software teams in project quality evaluation and test resource allocation tasks.

The purpose of the invention is realized by the following technical scheme:

a software defect prediction method based on a convolutional neural network comprises the following steps:

1) analyzing each file source code in the software project to obtain AST (abstract Syntax trees) Token vector of each file to form an AST Token vector set;

2) establishing mapping between the integer and Token, and converting vectors in the AST Token vector set obtained in the step 1) into numerical vectors required by the input of a convolutional neural network;

3) processing the problem of classification imbalance of the numerical value vector set data obtained in the step 2) by utilizing a few category data oversampling technology (SMOTE), wherein the final obtained result is a balanced data set;

4) constructing a convolutional neural network on the basis of the numerical vector set subjected to the classification unbalance problem processing in the step 3), and extracting a feature vector capable of expressing the code semantics;

5) combining the feature vector of convolutional neural network learning in the step 4) with Halstead features based on operators and operands, McCabe features based on dependency and CK features based on object-oriented programs to form a more complete static code feature vector, and further obtaining a data set based on the feature vector;

6) inputting the data set obtained in the step 5) into a Support Vector Machine (SVM) classifier, and training a software defect prediction model based on a convolutional neural network.

The step 3) is specifically as follows:

preprocessing the training data by a minority class data oversampling technique (SMOTE); SMOTE belongs to a classification unbalanced data processing technology based on oversampling. It synthesizes new few samples before model training, providing balanced class distribution for the learning task.

The oversampling technology comprises the following specific processes:

A. for each sample in the minority class, finding its k neighbors;

B. randomly selecting k neighbors from each few class samples according to the sample imbalance proportion; selecting the number of neighbors and calculating according to the class unbalance rate;

given M as the number of minority class samples and N as the number of majority class samples, the strategy of the oversampling technique is to calculate the imbalance ratio IR as M/N; randomly selecting the number rk of neighbors to be 1/IR-1, and if the neighbors cannot be completely removed, rounding up;

C. the synthesized minority class samples are interpolated between the minority class samples and the selected neighbors.

The step 4) is specifically as follows:

training a set of convolutional neural networks by using the equalized data set obtained in the step 3), namely learning weights and deviations in the convolutional neural networks;

specifically, the convolutional neural network consists of an input layer, two convolutional layers, two average pooling layers and a fully-connected hidden layer, and each layer of the convolutional neural network uses a ReLU function as an activation function; the convolutional neural network is realized by using a Keras tool, and the features for expressing static code semantics can be extracted after the convolutional neural network with two convolutional layers is constructed.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the method utilizes the convolutional neural network to generate the automatic features of the software source code, can effectively capture the complex nonlinear features in the source code, solves the problem of semantic feature analysis by deletion in the software defect prediction method, and further improves the accuracy of a defect prediction model.

Drawings

FIG. 1 is a flowchart of a software defect prediction method based on a convolutional neural network according to the present invention.

Fig. 2 is a diagram illustrating an example of AST Token vector analysis.

FIG. 3 is a schematic flow chart of obtaining semantic expression features by a convolutional neural network.

FIG. 4 is a detailed schematic diagram of a software defect prediction flow based on a convolutional neural network.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Referring to fig. 1, 2, 3 and 4, a software defect prediction method based on a convolutional neural network includes the following steps:

step 1) analyzing each file source code in the software project to obtain an Abstract Syntax Trees (AST) Token vector of each file to form an AST Token vector set. The concrete implementation is as follows: the present invention selects nodes in the AST of the source code file as the resolution granularity of the vector. An open source Java library named JDT-core is used to parse the source code of the software file into the AST Token vector. We mainly select three types of nodes on AST as markers: 1) declaration nodes (including method declarations, type declarations, etc.) whose values are extracted as the values of Token, 2) nodes For method calls and class instance creation, which are recorded as their method names or class names, put into Token, and 3) control flow nodes including If statements, While statements, For statements, Throw statements, etc., are simply recorded as their node types. The selected primary AST nodes of the present invention are listed in table 1. In this way, we convert each source file in the software project into a Token vector. As shown in the code example shown in fig. 2, the code segment may be parsed into an AST Token vector: [ For, myMethod, If, contact, print ].

And 2) converting the AST Token vector obtained in the step 1) into a numerical vector required by the input of the convolutional neural network. Since the input of the convolutional neural network needs to be a set of numerical vectors, this step needs to convert the AST Token vector obtained in step 1) into a numerical vector. To solve this problem, we first create a mapping between integers and Token, and then convert the Token vector to an integer vector. Each Token is associated with a unique integer identifier that ranges from 1 to the total number of Token node types. Furthermore, convolutional neural networks require that the input vectors have the same length, but the length of the last integer vector of different files tends to be different. We therefore append a 0 to the tail of each integer vector, making its length consistent with the longest integer vector. In addition, we only deal with Token nodes that appear twice or more in the software project, and represent Token nodes that appear only once as 0, so as to avoid that the file calls few statements to cause some less significant integer vectors. An example of a set of three AST Token vectors and their conversion into integer vectors is listed in table 2, where the Continue statement only appears once in the set and therefore we don't consider it, it is marked as 0 in the first integer vector, the third AST Token vector is the longest sequence, length 7, so the first and second integer vectors are filled with 0 to length 7.

Table 1: the primary AST node selected by the present invention

Table 2: AST Token vector is converted into numerical vector

And 3) carrying out classification unbalance processing on the numerical value vector set data obtained in the step 2) by oversampling (applying an SMOTE method) on a few types of data. In fact, the defect rate of some software projects is low or high, so classification imbalance is a ubiquitous and challenging problem in the software defect prediction problem based on machine learning. The class imbalance problem can cause difficulty in learning if we train classification models for highly classified imbalanced datasets, because the trained classifier tends to select most classes and is less able to classify a few classes. To alleviate this problem, classification unbalanced learning techniques are widely used. In the present invention, we use a few class data oversampling technique (SMOTE) to preprocess model training data prior to prediction model learning. SMOTE here is an oversampling technique that can comprehensively create new samples of a small number of classes, providing a more balanced classification distribution for the learning task.

And 4) constructing a convolutional neural network on the basis of the numerical vector set subjected to the classification of the imbalance problem in the step 3), and extracting expression characteristics capable of realizing static code semantics. We train a set of convolutional neural networks by using the equalized data set obtained in step 3), i.e. learning the weights and bias in the convolutional neural networks. Specifically, our convolutional neural network consists of one input layer, two convolutional layers, two average pooling layers, and a fully connected hidden layer, all of the seven layers mentioned here use the ReLU function as the activation function. Fig. 3 is a schematic diagram of the flow of obtaining semantic features by the convolutional neural network in this step. The implementation of the convolutional neural network in the present invention utilizes a Keras tool (http:// Keras. io). The convolutional neural network can be quickly constructed through Keras, and the feature vectors extracted by the convolutional neural network are obtained.

And 5) combining the feature vector of the convolutional neural network learning in the step 4) with Halstead features based on operators and operands, McCabe features based on dependency and CK features based on object-oriented programs (typical Halstead, McCabe and CK features are listed in Table 3), forming a more complete static code feature vector and obtaining a data set based on the feature vector. Until step 4), only the semantic-based features extracted by the convolutional neural network are considered. However, in conventional defect prediction methods, other features such as complexity metrics and object-oriented program features also contain static code feature information that can be used to predict defects. These static features can all be obtained by code analysis. To exploit this information, we concatenate the convolutional neural network learned feature vectors (represented by the fully concatenated hidden layer) with the traditional static feature vectors. This join may be implemented by a merge operator in Keras. Fig. 4 is a schematic diagram of the steps of combining the features extracted by the convolutional neural network with the conventional manual features. After completing the feature vector merge we get a data set based on the feature vector. This step also performs Z-Score normalization on the data set after the above operations are completed.

Table 3: representative Halstead, McCabe and CK characteristics

Halstead&McCabe characteristics	CK characterization
		Number of operators and number of operations	Number of methods of class
Number of objects to be operated and number of times of operation	Depth of class in inheritance tree
		Basic complexity	Number of child nodes of class in inheritance tree
Complexity of loop	Number of other classes having a coupling relation with a class
		Complexity of design	Number of external methods that can be called in a class
Number of code lines	Number of methods in a class to access one or more attributes

6) Inputting the data set obtained in the step 5) into a Support Vector Machine (SVM) classifier, and training a software defect prediction model based on a convolutional neural network. The training of the SVM classifier is realized by utilizing an open source library Libsvm (https:// www.csie.ntu.edu.tw/. about cjlin/Libsvm /). The last convolutional neural network unit output layer as a support vector machine classifier uses Sigmoid as an activation function. FIG. 4 is a detailed schematic diagram of a software defect prediction flow based on a convolutional neural network.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A software defect prediction method based on a convolutional neural network is characterized by comprising the following steps:

1) analyzing source codes of all files in the software project to obtain AST Token vectors of all files and form an AST Token vector set;

3) processing the problem of classification imbalance of the numerical vector set data obtained in the step 2) by utilizing a few category data oversampling technology, wherein the final obtained result is a balanced data set;

6) inputting the data set obtained in the step 5) into a support vector machine classifier, and training a software defect prediction model based on a convolutional neural network.

2. The convolutional neural network-based software defect prediction method as claimed in claim 1, wherein the step 3) is specifically as follows:

preprocessing training data through a few-category data oversampling technology;

the oversampling technology comprises the following specific processes:

A. for each sample in the minority class, finding its k neighbors;

3. The convolutional neural network-based software defect prediction method as claimed in claim 1, wherein the step 4) is specifically as follows:

the convolutional neural network consists of an input layer, two convolutional layers, two average pooling layers and a fully-connected hidden layer, and each layer of the convolutional neural network uses a ReLU function as an activation function; the convolutional neural network is realized by using a Keras tool, and the features for expressing static code semantics can be extracted after the convolutional neural network with two convolutional layers is constructed.