CN108829607B - Software defect prediction method based on convolutional neural network - Google Patents

Software defect prediction method based on convolutional neural network Download PDF

Info

Publication number
CN108829607B
CN108829607B CN201810743379.3A CN201810743379A CN108829607B CN 108829607 B CN108829607 B CN 108829607B CN 201810743379 A CN201810743379 A CN 201810743379A CN 108829607 B CN108829607 B CN 108829607B
Authority
CN
China
Prior art keywords
convolutional neural
neural network
vector
defect prediction
software
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810743379.3A
Other languages
Chinese (zh)
Other versions
CN108829607A (en
Inventor
陆璐
邱少健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201810743379.3A priority Critical patent/CN108829607B/en
Publication of CN108829607A publication Critical patent/CN108829607A/en
Application granted granted Critical
Publication of CN108829607B publication Critical patent/CN108829607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Abstract

The invention discloses a software defect prediction method based on a convolutional neural network, which comprises the following steps of: analyzing each file source code in the software project to form an AST Token vector set; establishing mapping between integers and Token, and converting AST Token vector into numerical value vector; carrying out classification unbalance processing on the numerical vector set data by using an SMOTE technology; constructing a convolutional neural network on the basis of the numerical vector set, and extracting a feature vector capable of expressing the semantics of the code; combining the features of convolutional neural network learning with the traditional manual static features; and inputting the data set with the merged features into a support vector machine classifier, and training a software defect prediction model. The method can be directly applied to the defect prediction task of actual software, can capture the semantic features of the source code, solves the problem of missing on semantic feature analysis in the traditional method, and further improves the accuracy of a defect prediction model.

Description

Software defect prediction method based on convolutional neural network
Technical Field
The invention relates to the field of software analysis and defect prediction in software engineering, in particular to a software defect prediction method based on a convolutional neural network.
Background
Potential and unknown defects in the software seriously affect the quality of the software, so that software analysis and defect prediction technology plays an important role in software quality assurance tasks. If the software defect can be found early, the software team can be helped to know the quality state of the current project, and then test resources can be reasonably distributed. However, manual review of all code units in a project is impractical, and as a result, an increasing number of software engineering research and practitioners are beginning to focus heavily on machine learning-based software defect prediction techniques and attempt to use multiple machine learning methods to detect potentially defective modules and files in software.
Firstly, extracting code characteristics from a source file of a software project, wherein the code characteristics mainly comprise Halstead characteristics based on operators and operands, McCabe characteristics based on dependency, CK characteristics based on object-oriented programs, Change characteristics based on code Change history and the like. These features mainly utilize manual features that software analysts refine. Furthermore, the software defect prediction method uses a machine learning algorithm to perform supervised learning on historical defect data to obtain a classification model, namely, on the basis of performing manual feature extraction on the historical data, a base classifier such as a support vector machine, a random forest, a logistic regression and the like is used for learning the defect prediction model. The predicted results will help the software quality assurance team find code regions that may contain defects.
However, the manual features utilized by the traditional software defect prediction method based on machine learning cannot capture and utilize abundant semantic features in software codes, so that the prediction performance is often not ideal. The convolutional neural network is one of powerful technologies for automatic feature generation, can effectively capture complex nonlinear features in software codes, and can solve the problem that semantic features are missing from manual features in a software defect prediction method.
However, it is a difficult problem for those skilled in the art to implement the application of the convolutional neural network in the field of software defect detection.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a software defect prediction method based on a convolutional neural network. The method is easy to operate, a set of prediction results of the defect tendency of each file of the software project can be provided only by submitting a software source code on the basis of a standard software defect prediction process, and a reference basis is provided for software teams in project quality evaluation and test resource allocation tasks.
The purpose of the invention is realized by the following technical scheme:
a software defect prediction method based on a convolutional neural network comprises the following steps:
1) analyzing each file source code in the software project to obtain AST (abstract Syntax trees) Token vector of each file to form an AST Token vector set;
2) establishing mapping between the integer and Token, and converting vectors in the AST Token vector set obtained in the step 1) into numerical vectors required by the input of a convolutional neural network;
3) processing the problem of classification imbalance of the numerical value vector set data obtained in the step 2) by utilizing a few category data oversampling technology (SMOTE), wherein the final obtained result is a balanced data set;
4) constructing a convolutional neural network on the basis of the numerical vector set subjected to the classification unbalance problem processing in the step 3), and extracting a feature vector capable of expressing the code semantics;
5) combining the feature vector of convolutional neural network learning in the step 4) with Halstead features based on operators and operands, McCabe features based on dependency and CK features based on object-oriented programs to form a more complete static code feature vector, and further obtaining a data set based on the feature vector;
6) inputting the data set obtained in the step 5) into a Support Vector Machine (SVM) classifier, and training a software defect prediction model based on a convolutional neural network.
The step 3) is specifically as follows:
preprocessing the training data by a minority class data oversampling technique (SMOTE); SMOTE belongs to a classification unbalanced data processing technology based on oversampling. It synthesizes new few samples before model training, providing balanced class distribution for the learning task.
The oversampling technology comprises the following specific processes:
A. for each sample in the minority class, finding its k neighbors;
B. randomly selecting k neighbors from each few class samples according to the sample imbalance proportion; selecting the number of neighbors and calculating according to the class unbalance rate;
given M as the number of minority class samples and N as the number of majority class samples, the strategy of the oversampling technique is to calculate the imbalance ratio IR as M/N; randomly selecting the number rk of neighbors to be 1/IR-1, and if the neighbors cannot be completely removed, rounding up;
C. the synthesized minority class samples are interpolated between the minority class samples and the selected neighbors.
The step 4) is specifically as follows:
training a set of convolutional neural networks by using the equalized data set obtained in the step 3), namely learning weights and deviations in the convolutional neural networks;
specifically, the convolutional neural network consists of an input layer, two convolutional layers, two average pooling layers and a fully-connected hidden layer, and each layer of the convolutional neural network uses a ReLU function as an activation function; the convolutional neural network is realized by using a Keras tool, and the features for expressing static code semantics can be extracted after the convolutional neural network with two convolutional layers is constructed.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the method utilizes the convolutional neural network to generate the automatic features of the software source code, can effectively capture the complex nonlinear features in the source code, solves the problem of semantic feature analysis by deletion in the software defect prediction method, and further improves the accuracy of a defect prediction model.
Drawings
FIG. 1 is a flowchart of a software defect prediction method based on a convolutional neural network according to the present invention.
Fig. 2 is a diagram illustrating an example of AST Token vector analysis.
FIG. 3 is a schematic flow chart of obtaining semantic expression features by a convolutional neural network.
FIG. 4 is a detailed schematic diagram of a software defect prediction flow based on a convolutional neural network.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Referring to fig. 1, 2, 3 and 4, a software defect prediction method based on a convolutional neural network includes the following steps:
step 1) analyzing each file source code in the software project to obtain an Abstract Syntax Trees (AST) Token vector of each file to form an AST Token vector set. The concrete implementation is as follows: the present invention selects nodes in the AST of the source code file as the resolution granularity of the vector. An open source Java library named JDT-core is used to parse the source code of the software file into the AST Token vector. We mainly select three types of nodes on AST as markers: 1) declaration nodes (including method declarations, type declarations, etc.) whose values are extracted as the values of Token, 2) nodes For method calls and class instance creation, which are recorded as their method names or class names, put into Token, and 3) control flow nodes including If statements, While statements, For statements, Throw statements, etc., are simply recorded as their node types. The selected primary AST nodes of the present invention are listed in table 1. In this way, we convert each source file in the software project into a Token vector. As shown in the code example shown in fig. 2, the code segment may be parsed into an AST Token vector: [ For, myMethod, If, contact, print ].
And 2) converting the AST Token vector obtained in the step 1) into a numerical vector required by the input of the convolutional neural network. Since the input of the convolutional neural network needs to be a set of numerical vectors, this step needs to convert the AST Token vector obtained in step 1) into a numerical vector. To solve this problem, we first create a mapping between integers and Token, and then convert the Token vector to an integer vector. Each Token is associated with a unique integer identifier that ranges from 1 to the total number of Token node types. Furthermore, convolutional neural networks require that the input vectors have the same length, but the length of the last integer vector of different files tends to be different. We therefore append a 0 to the tail of each integer vector, making its length consistent with the longest integer vector. In addition, we only deal with Token nodes that appear twice or more in the software project, and represent Token nodes that appear only once as 0, so as to avoid that the file calls few statements to cause some less significant integer vectors. An example of a set of three AST Token vectors and their conversion into integer vectors is listed in table 2, where the Continue statement only appears once in the set and therefore we don't consider it, it is marked as 0 in the first integer vector, the third AST Token vector is the longest sequence, length 7, so the first and second integer vectors are filled with 0 to length 7.
Table 1: the primary AST node selected by the present invention
Figure GDA0003108071490000051
Table 2: AST Token vector is converted into numerical vector
Figure GDA0003108071490000061
And 3) carrying out classification unbalance processing on the numerical value vector set data obtained in the step 2) by oversampling (applying an SMOTE method) on a few types of data. In fact, the defect rate of some software projects is low or high, so classification imbalance is a ubiquitous and challenging problem in the software defect prediction problem based on machine learning. The class imbalance problem can cause difficulty in learning if we train classification models for highly classified imbalanced datasets, because the trained classifier tends to select most classes and is less able to classify a few classes. To alleviate this problem, classification unbalanced learning techniques are widely used. In the present invention, we use a few class data oversampling technique (SMOTE) to preprocess model training data prior to prediction model learning. SMOTE here is an oversampling technique that can comprehensively create new samples of a small number of classes, providing a more balanced classification distribution for the learning task.
And 4) constructing a convolutional neural network on the basis of the numerical vector set subjected to the classification of the imbalance problem in the step 3), and extracting expression characteristics capable of realizing static code semantics. We train a set of convolutional neural networks by using the equalized data set obtained in step 3), i.e. learning the weights and bias in the convolutional neural networks. Specifically, our convolutional neural network consists of one input layer, two convolutional layers, two average pooling layers, and a fully connected hidden layer, all of the seven layers mentioned here use the ReLU function as the activation function. Fig. 3 is a schematic diagram of the flow of obtaining semantic features by the convolutional neural network in this step. The implementation of the convolutional neural network in the present invention utilizes a Keras tool (http:// Keras. io). The convolutional neural network can be quickly constructed through Keras, and the feature vectors extracted by the convolutional neural network are obtained.
And 5) combining the feature vector of the convolutional neural network learning in the step 4) with Halstead features based on operators and operands, McCabe features based on dependency and CK features based on object-oriented programs (typical Halstead, McCabe and CK features are listed in Table 3), forming a more complete static code feature vector and obtaining a data set based on the feature vector. Until step 4), only the semantic-based features extracted by the convolutional neural network are considered. However, in conventional defect prediction methods, other features such as complexity metrics and object-oriented program features also contain static code feature information that can be used to predict defects. These static features can all be obtained by code analysis. To exploit this information, we concatenate the convolutional neural network learned feature vectors (represented by the fully concatenated hidden layer) with the traditional static feature vectors. This join may be implemented by a merge operator in Keras. Fig. 4 is a schematic diagram of the steps of combining the features extracted by the convolutional neural network with the conventional manual features. After completing the feature vector merge we get a data set based on the feature vector. This step also performs Z-Score normalization on the data set after the above operations are completed.
Table 3: representative Halstead, McCabe and CK characteristics
Halstead&McCabe characteristics CK characterization
Number of operators and number of operations Number of methods of class
Number of objects to be operated and number of times of operation Depth of class in inheritance tree
Basic complexity Number of child nodes of class in inheritance tree
Complexity of loop Number of other classes having a coupling relation with a class
Complexity of design Number of external methods that can be called in a class
Number of code lines Number of methods in a class to access one or more attributes
6) Inputting the data set obtained in the step 5) into a Support Vector Machine (SVM) classifier, and training a software defect prediction model based on a convolutional neural network. The training of the SVM classifier is realized by utilizing an open source library Libsvm (https:// www.csie.ntu.edu.tw/. about cjlin/Libsvm /). The last convolutional neural network unit output layer as a support vector machine classifier uses Sigmoid as an activation function. FIG. 4 is a detailed schematic diagram of a software defect prediction flow based on a convolutional neural network.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (3)

1. A software defect prediction method based on a convolutional neural network is characterized by comprising the following steps:
1) analyzing source codes of all files in the software project to obtain AST Token vectors of all files and form an AST Token vector set;
2) establishing mapping between the integer and Token, and converting vectors in the AST Token vector set obtained in the step 1) into numerical vectors required by the input of a convolutional neural network;
3) processing the problem of classification imbalance of the numerical vector set data obtained in the step 2) by utilizing a few category data oversampling technology, wherein the final obtained result is a balanced data set;
4) constructing a convolutional neural network on the basis of the numerical vector set subjected to the classification unbalance problem processing in the step 3), and extracting a feature vector capable of expressing the code semantics;
5) combining the feature vector of convolutional neural network learning in the step 4) with Halstead features based on operators and operands, McCabe features based on dependency and CK features based on object-oriented programs to form a more complete static code feature vector, and further obtaining a data set based on the feature vector;
6) inputting the data set obtained in the step 5) into a support vector machine classifier, and training a software defect prediction model based on a convolutional neural network.
2. The convolutional neural network-based software defect prediction method as claimed in claim 1, wherein the step 3) is specifically as follows:
preprocessing training data through a few-category data oversampling technology;
the oversampling technology comprises the following specific processes:
A. for each sample in the minority class, finding its k neighbors;
B. randomly selecting k neighbors from each few class samples according to the sample imbalance proportion; selecting the number of neighbors and calculating according to the class unbalance rate;
given M as the number of minority class samples and N as the number of majority class samples, the strategy of the oversampling technique is to calculate the imbalance ratio IR as M/N; randomly selecting the number rk of neighbors to be 1/IR-1, and if the neighbors cannot be completely removed, rounding up;
C. the synthesized minority class samples are interpolated between the minority class samples and the selected neighbors.
3. The convolutional neural network-based software defect prediction method as claimed in claim 1, wherein the step 4) is specifically as follows:
training a set of convolutional neural networks by using the equalized data set obtained in the step 3), namely learning weights and deviations in the convolutional neural networks;
the convolutional neural network consists of an input layer, two convolutional layers, two average pooling layers and a fully-connected hidden layer, and each layer of the convolutional neural network uses a ReLU function as an activation function; the convolutional neural network is realized by using a Keras tool, and the features for expressing static code semantics can be extracted after the convolutional neural network with two convolutional layers is constructed.
CN201810743379.3A 2018-07-09 2018-07-09 Software defect prediction method based on convolutional neural network Active CN108829607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810743379.3A CN108829607B (en) 2018-07-09 2018-07-09 Software defect prediction method based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810743379.3A CN108829607B (en) 2018-07-09 2018-07-09 Software defect prediction method based on convolutional neural network

Publications (2)

Publication Number Publication Date
CN108829607A CN108829607A (en) 2018-11-16
CN108829607B true CN108829607B (en) 2021-08-10

Family

ID=64136032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810743379.3A Active CN108829607B (en) 2018-07-09 2018-07-09 Software defect prediction method based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN108829607B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376092B (en) * 2018-11-26 2021-07-13 扬州大学 Automatic analysis method for software defect reasons for defect patch codes
CN109710512A (en) * 2018-12-06 2019-05-03 南京邮电大学 Neural network software failure prediction method based on geodesic curve stream core
CN110109835B (en) * 2019-05-05 2021-03-30 重庆大学 Software defect positioning method based on deep neural network
CN110134108B (en) * 2019-05-14 2021-10-22 内蒙古电力(集团)有限责任公司内蒙古电力科学研究院分公司 Code defect testing method and device
CN110489348B (en) * 2019-08-23 2023-08-25 山东浪潮科学研究院有限公司 Software functional defect mining method based on migration learning
CN110825615A (en) * 2019-09-23 2020-02-21 中国科学院信息工程研究所 Software defect prediction method and system based on network embedding
CN110597735B (en) * 2019-09-25 2021-03-05 北京航空航天大学 Software defect prediction method for open-source software defect feature deep learning
CN111522743B (en) * 2020-04-17 2021-10-22 北京理工大学 Software defect prediction method based on gradient lifting tree support vector machine
CN111767216B (en) * 2020-06-23 2022-08-09 江苏工程职业技术学院 Cross-version depth defect prediction method capable of relieving class overlap problem
CN112070716A (en) * 2020-08-03 2020-12-11 西安理工大学 Printing defect intelligent identification method based on deep learning
CN112328469B (en) * 2020-10-22 2022-03-18 南京航空航天大学 Function level defect positioning method based on embedding technology
CN112286807B (en) * 2020-10-28 2022-01-28 北京航空航天大学 Software defect positioning system based on source code file dependency relationship
CN112631898A (en) * 2020-12-09 2021-04-09 南京理工大学 Software defect prediction method based on CNN-SVM
CN112905468A (en) * 2021-02-20 2021-06-04 华南理工大学 Ensemble learning-based software defect prediction method, storage medium and computing device
CN113608747B (en) * 2021-08-18 2024-04-02 南京航空航天大学 Software defect prediction method and terminal based on graph convolution neural network
CN113722218B (en) * 2021-08-23 2022-06-03 南京审计大学 Software defect prediction model construction method based on compiler intermediate representation
CN114169500B (en) * 2021-11-30 2023-04-18 电子科技大学 Neural network model processing method based on small-scale electromagnetic data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930277A (en) * 2016-07-11 2016-09-07 南京大学 Defect source code locating method based on defect report analysis
CN107103235A (en) * 2017-02-27 2017-08-29 广东工业大学 A kind of Android malware detection method based on convolutional neural networks
CN107885999A (en) * 2017-11-08 2018-04-06 华中科技大学 A kind of leak detection method and system based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170212829A1 (en) * 2016-01-21 2017-07-27 American Software Safety Reliability Company Deep Learning Source Code Analyzer and Repairer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930277A (en) * 2016-07-11 2016-09-07 南京大学 Defect source code locating method based on defect report analysis
CN107103235A (en) * 2017-02-27 2017-08-29 广东工业大学 A kind of Android malware detection method based on convolutional neural networks
CN107885999A (en) * 2017-11-08 2018-04-06 华中科技大学 A kind of leak detection method and system based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Automatically Learning Semantic Features for Defect Prediction;Song Wang等;《2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE)》;20170403;全文 *
Software Defect Prediction via Convolutional Neural Network;Jian Li等;《 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS)》;20170815;全文 *
基于改进ACO优化BPNN的软件缺陷预测模型;李克文等;《计算机工程与设计》;20170816;第38卷(第8期);全文 *

Also Published As

Publication number Publication date
CN108829607A (en) 2018-11-16

Similar Documents

Publication Publication Date Title
CN108829607B (en) Software defect prediction method based on convolutional neural network
CN107608877B (en) Automatic application program interface testing method and system based on machine learning
CN107273294B (en) Repetitive code detection method based on neural network language model
US11269760B2 (en) Systems and methods for automated testing using artificial intelligence techniques
Singh et al. Software defect prediction tool based on neural network
WO2019100635A1 (en) Editing method and apparatus for automated test script, terminal device and storage medium
JP2021526687A (en) Ensemble-based data curation pipeline for efficient label propagation
CN111290947B (en) Cross-software defect prediction method based on countermeasure judgment
CN110909224A (en) Sensitive data automatic classification and identification method and system based on artificial intelligence
CN114546365A (en) Flow visualization modeling method, server, computer system and medium
US10402303B2 (en) Determining similarities in computer software codes for performance analysis
CN114731341A (en) Information acquisition method, equipment and system
EP4339843A1 (en) Neural network optimization method and apparatus
Yashaswini et al. HTML Code Generation from Website Images and Sketches using Deep Learning-Based Encoder-Decoder Model
CN115587111A (en) Radix estimation method and device for database query optimization
CN114706558A (en) K-TRUSS-based complex software key module identification method
CN114995729A (en) Voice drawing method and device and computer equipment
CN112416800A (en) Intelligent contract testing method, device, equipment and storage medium
Fan et al. High-frequency keywords to predict defects for android applications
Singh et al. Improving event log quality using autoencoders and performing quantitative analysis with conformance checking
CN117435246B (en) Code clone detection method based on Markov chain model
Duan et al. MultiCode: A Unified Code Analysis Framework based on Multi-type and Multi-granularity Semantic Learning
Chen et al. A novel method to analyze logs generated by wireless telecommunication systems
Zhu et al. An Efficient Design Smell Detection Approach with Inter-class Relation
CN115796228B (en) Operator fusion method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant