CN117215935A

CN117215935A - Software defect prediction method based on multidimensional code joint graph representation

Info

Publication number: CN117215935A
Application number: CN202311174665.XA
Authority: CN
Inventors: 王易天; 刘望舒; 刘学军
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2023-12-12

Abstract

The invention relates to a software defect prediction method based on multi-dimensional code joint graph representation. The method comprises the steps of firstly submitting source codes containing code defect labels to a storage library, decomposing the source codes in the storage library into code content information and code annotation information, then extracting abstract syntax trees in the code content information, traversing the abstract syntax trees to obtain dependency relationship characterization data and call relationship characterization data, inputting Word sequences extracted from the code annotation and node sequences of the abstract syntax trees into a Word2Vec model together to obtain corresponding Word vectors, then splicing to obtain final representation of graph nodes, constructing corresponding edges by utilizing the obtained dependency relationships and call relationships to obtain complete graph structure representation, and finally inputting the graph structure representation into a graph convolution neural network GCN prediction model to complete even defect prediction tasks. By means of the advancement and advantages of the graphic neural network, the method and the device can better represent code information through annotation and code joint graphs, and further improve accuracy of software defects.

Description

Software defect prediction method based on multidimensional code joint graph representation

Technical Field

The invention relates to a software defect prediction method based on multi-dimensional code joint graph representation, and belongs to the technical field of software analysis and software defect prediction in software engineering.

Background

With the increasing size and complexity of software, the quality problem of the software becomes a focus of attention, the software defect is the opposite of the quality of the software, the quality of the software is threatened, and how to dig out a defect module in the early stage of software development becomes a problem to be solved. The software defect prediction designs an intrinsic metric element related to the defect by excavating a software history warehouse, and then discovers and locks a defect module in advance by means of machine learning and other methods, so that limited resources are reasonably allocated. Software defect prediction has long been a topic of interest to researchers and industry practitioners because knowing the predicted defect likelihood helps to improve software quality and reduce costs. For example, a developer may use software bugs to prioritize maintenance tasks, plan activities to reduce technical liabilities, estimate the cost and required resources of project quality assurance work, and improve the overall development process by knowing the system factors that lead to bugs occurring. Therefore, software defect prediction is one of the important ways of software quality assurance. The patents related to software defect prediction mainly include: a software defect prediction model based on a deep neural network and a probabilistic decision forest (publication number CN 109446090); a software defect prediction method (CN 113778862A) based on a long-term memory network and an LASSO algorithm, and the like.

In recent years, many machine learning-based methods and various software metrics have been proposed and applied to software defect prediction. Most existing works attempt to understand the behavior of a program by extracting the metric features of the program to understand the program semantics, thereby constructing a predictive classification model based on the software metric features. Software defect prediction aims at predicting the likelihood of defects in a software artifact, typically defects in source code elements (e.g., methods, classes, files, components) of different granularity, from extracting metric features of a program as a program representation. However, the existing software prediction method of conventional deep learning ignores that the source code has abundant semantic information and structural information, and semantic features reflecting the detection mode may not be captured well by the conventional feature extraction method. The semi-supervised learning model based on graph structure proposed by Kipf et al in 2017 has been widely used for code representation for software defect prediction for solving the problem of source code structure semantics. The structural semantics and properties of source code are different from natural language, traditional defect prediction features may not capture the features of defect patterns well and some defects in the module may not be caused by code structure errors, but by code functional errors. In 2018 Xuan Huo et al embedded code annotations into semantic features for software defect prediction, solving the problem of representation of code functions to some extent.

Disclosure of Invention

The invention aims to solve the technical problem of providing a software defect prediction method based on multi-dimensional code joint graph representation, which solves the problem of association between the lack of program grammar semantics and comments in the software defect prediction by means of the ideas of abstract grammar trees and program dependency relations. The prediction efficiency of software defect prediction can be effectively improved.

In order to solve the technical problems, the following technical scheme is adopted: the invention designs a software defect prediction method based on multi-dimensional code joint graph representation, which is used for realizing code defect prediction and a code defect prediction model construction method aiming at the content and annotation of each software source code in a corresponding code library. The software defect prediction model construction method and the model training process comprise the following steps:

submitting source code information containing the classmark into a storage library to serve as a code sample, and decomposing the code sample into a source code without comments and a code comment, wherein the source code without comments generally comprises a plurality of sub-parts, such as: definition of functions and classes, definition of variables and constants, statement of control flow, error handling, exception handling, etc., and then enter step B;

step B, extracting abstract grammar tree AST structures from source codes which do not contain comments aiming at the two parts in the code samples, obtaining abstract grammar tree nodes, traversing the abstract grammar tree to obtain dependency relationship characterization data and call relationship characterization data in the abstract grammar tree nodes, obtaining feature sequences of the grammar tree nodes, the dependency relationships and the call relationship data, extracting word sequences from code comment texts, encoding the word sequences to obtain code comment sequences, and entering the step C;

and C, converting the new sequence into a corresponding vector by a Word2vec model aiming at the grammar tree node and the code annotation sequence which are respectively obtained in the step B: the grammar tree node vector and the code annotation vector are then entered into step D;

step D, aiming at the grammar tree node vector and the code annotation vector which are obtained from the code information, splicing the grammar tree node vector and the code annotation vector into a new node vector to be used as a final node vector representation of a graph structure, constructing a complete graph structure representation by using the dependency relationship and the calling relationship analyzed in the steps B1.3 and B1.4 as edges, and then entering the step E;

and E, inputting the graph structure representation obtained in the step D into a graph convolution neural network (GCN), constructing an atlas representation by using pooling operation after a plurality of convolution layers, integrating all the learned information in the network through two full-connection layers, outputting defect tendency by using a Softmax classifier, and then entering the step F.

And F, constructing a graph structure prediction model based on AST word vectors corresponding to code content and word vectors corresponding to code annotation, executing a training task aiming at the constructed graph convolutional neural network GCN model, optimizing GCN model parameters, and obtaining a trained GCN prediction model, namely obtaining a model of software defect prediction.

When the software defect prediction model is constructed, the steps I to III are required to be further executed.

Step I, collecting source codes and decomposing the source codes into modules, carrying out defect marking treatment on the modules, marking the defective modules, and then entering the step II;

step II, dividing the marked data set into a training set and a testing set, wherein 70% and 30% of the source code data set are respectively used as the training set and the testing set, and then entering the step III;

and III, inputting the divided data set into a constructed defect prediction model, so that the model learns the relation between the features and the labels, and optimizing and adjusting the model, thereby obtaining a better prediction effect.

As a preferred technical scheme of the invention: in the step B, abstract syntax trees are extracted according to the following step B1 for code content information, semantic structure relations are analyzed, control dependency relations and data dependency relations are obtained, and coding is carried out according to the following step B2 for code annotation information:

as a preferred technical scheme of the invention: in the step B1, an abstract syntax tree structure is extracted for the source code content submitted to the code library, and structural relationships are analyzed to obtain control dependency relationships and calling relationships, and the steps are performed according to the following steps B1.1 to B1.4:

step B1.1, merging code contents in a source code into a token identifier according to a preset rule in the lexical analysis process, removing redundant information such as blank, annotation and the like in a code sentence, and finally dividing all the synthesized token identifiers into a token list;

step B1.2, the grammar list obtained in the step B1.1 is converted into a tree structure in the grammar analysis process, and meanwhile, whether grammar errors exist or not is verified, and the part with the grammar errors is thrown out;

step B1.3, traversing the constructed grammar tree when the dependency analysis is carried out, calculating the control and data dependency in the code statement, and obtaining the characterization data of the dependency;

and B1.4, traversing the constructed grammar tree when carrying out call relation analysis, extracting function call nodes and definition nodes in the grammar tree, constructing the relation between the function call nodes and the definition nodes, and obtaining characterization data of the call relation.

As a preferred technical scheme of the invention: the code annotation in the source code is shown in the step B2, the word sequence in the source code is extracted through a Natural Language Tool Kit (NLTK), the extraction process generally comprises three parts of word segmentation, word stem extraction and word sequence construction, and the extraction process is carried out according to the following steps B2.1 to B2.2:

step B2.1, word segmentation processing is carried out on the annotation text of the source code, special characters, punctuation marks, redundant spaces and other contents which possibly interfere with annotation conversion in the annotation text of the code are all removed, and then the annotation text is segmented into word sub-word sequences, wherein each word can be regarded as an element in the sequences;

step B2.2 is to combine the segmented annotation words into a complete word sequence, and use space, special mark and other separators to represent different word element sequences in the sequence.

As a preferred technical scheme of the invention: the step E comprises the following steps E1 to E2:

step E1. Obtaining an adjacency matrix from the complete graph structure representation for the source codeThe formula is as follows:

wherein the method comprises the steps ofA represents the adjacency matrix of each node in the graph structure, and it should be noted that each node ignores its own characteristics and therefore needs to add an identity matrix I _N N is the number of nodes in the graph structure, < >>Representation->Then enter step E2;

step E2. For the adjacency matrix obtained in the graph structure, the graph is rolled up to a propagation model H of the neural network GCN from layer l to layer l+1 ^(l+1) Can be expressed as:

wherein the method comprises the steps ofRepresenting the transfer function of the GCN network, σ (·) representing the activation function is typically used to enhance model expression capability, W ^(l) Represents a weight matrix, H ^(l) Status information of each layer in the GCN network, L e {0,1, 2..l }, L represents the total network layer number, H ^(L) =z (Z is the output of the network model), when l=0, H ⁽⁰⁾ =x (X is an information representation of each node in the graph structure), and then step F is entered.

As a preferred technical scheme of the invention: the step I is carried out according to the steps I1 to I3:

step 1, collecting a data set containing software defects from a company open source software project, a Bug database, a software defect report and other approaches;

step I2, defining specific and clear labeling rules aiming at different types of defects such as bug, error, security hole and the like in the data set;

and step I3, marking the prepared data set by a marker according to the formulated marking rule, marking the defective module as 1 and marking the non-defective module as 0.

As a preferred technical scheme of the invention: the step III is carried out according to the steps III 1 to III 2:

step III 1, inputting a training set marked with class marks into a prediction model, and training model parameters through the relation between the network model learning characteristics and the class marks to further obtain a trained prediction model;

and III 2, inputting the test set marked with the class mark into a trained prediction model, predicting whether the input module has defects, and optimizing and adjusting the model according to the test structure to obtain a final prediction model.

Compared with the prior art, the technical scheme of the invention has the following technical effects:

the invention solves the problem of correlation between the lack of program grammar semantics and annotation of codes in the software defect prediction by means of a software defect prediction method based on multi-dimensional code joint graph representation and by means of the ideas of abstract grammar trees and graph convolution neural network models. The advantage of the graph convolution neural network is that global information of the code segments can be better captured to construct a GCN prediction model to carry out defect prediction classification tasks. The abstract syntax tree AST is used for acquiring abundant semantic structure information in program sentences, and meanwhile, the code interpretive performance is enhanced by utilizing code annotation, so that the details of the codes are communicated more efficiently between developers, the accuracy of software defect prediction is improved, and important parameter basis is provided for a software development team in project evaluation quality and test resource allocation.

Drawings

FIG. 1 is a flow chart of a software defect prediction method based on a multi-dimensional code joint graph representation according to the present invention.

FIG. 2 is a schematic representation of a multi-dimensional code-based joint graph designed in accordance with the present invention.

FIG. 3 is a flow chart of a predictive model source code process contemplated by the present invention.

Detailed Description

The following describes the embodiments of the present invention in further detail with reference to the drawings.

The invention discloses a software defect prediction method based on a multi-dimensional code joint graph representation, which is used for realizing software defect prediction aiming at each source code information in a code library and combining corresponding annotation information of codes, wherein the flow of a software defect prediction model represented by the multi-dimensional code joint graph is shown in fig. 1 and 2, and the following steps A to F are specifically executed.

Submitting source code information containing the classmark to a storage library to serve as a code sample, and decomposing the code sample into a source code without annotation and a code annotation, wherein the first part generally comprises a plurality of sub-parts, such as: definition of functions and classes, variable and constant definition, control flow statements, error handling and exception handling, etc., and then proceeds to step B.

It should be noted that, the source code is submitted to the repository and divided into n modules, x= { X ₁ ,x ₂ ,...,x _n The set of modules in the repository,representing that source code information in source code is included +.>And code annotation->Where i e { 1..n }. Further, the source code information may be various language codes such as c++, java, PHP, python, and the like.

And B, extracting an AST structure of the abstract syntax tree from the source code which does not contain comments aiming at the two parts in the code sample, obtaining abstract syntax tree nodes, traversing the abstract syntax tree to obtain dependency relationship characterization data and call relationship characterization data, finally obtaining feature sequences of the syntax tree nodes, the dependency relationships and the call relationship data, extracting word sequences from the code comment text, encoding the word sequences, obtaining the code comment sequence, and entering the step C.

In practical application, abstract syntax tree is extracted according to the following step B1 for code content information, semantic structure relation is analyzed, control dependency relation and data dependency relation are obtained, and coding is carried out according to the following step B2 for code annotation information;

and B1, extracting an abstract syntax tree structure for the source code content submitted to the code library, analyzing the structural relation, and obtaining a control dependency relation and a calling relation, wherein the steps B1.1 to B1.4 are performed.

And B1.1, merging code contents in a source code into a token identifier according to a preset rule in the lexical analysis process, removing redundant information such as blank, annotation and the like in a code sentence, and finally dividing all the synthesized token identifiers into a token list.

And B1.2, converting the token list obtained in the step B1.1 into a tree structure in the grammar analysis process, and simultaneously verifying whether grammar errors exist or not and throwing out the part with the grammar errors.

And B1.3, traversing the constructed grammar tree when the dependency analysis is carried out, calculating the control and data dependency relationship in the code statement, and obtaining the characterization data of the dependency relationship.

In practical applications, the annotations in the source code are descriptive text, in order to explain the function, implementation method, algorithm, logic, design choice or other relevant information of the code, and thus to obtain its sequence the following step B2 is performed.

Step B2 extracts word sequences from the source code through Natural Language Toolkit (NLTK), wherein the extraction process generally comprises three parts of word segmentation, word stem extraction and word sequence construction, and is performed according to the following steps B2.1 to B2.2.

specifically, when text information is subjected to word segmentation in annotation text, the text information can be segmented according to spaces, punctuations, paragraphs and the like, and words in code annotation are often closely related to words in code content. For example, the get_support_action is split into [ Get, support, action ], and the code annotation text can be split into word sets through word segmentation.

NLTK toolkit is a widely used Python library for natural language processing tasks. NLTK provides many functions for processing natural language, and is applied to tasks such as text processing, language analysis, feature extraction, language model and the like.

and D, aiming at the grammar tree node vector and the code annotation vector which are obtained from the code information, splicing the grammar tree node vector and the code annotation vector into a new node vector to be used as a final node vector representation of the graph structure, constructing a complete graph structure representation by using the dependency relationship and the calling relationship analyzed in the steps B1.3 and B1.4 as edges, and then entering the step E.

And E, inputting the graph structure representation obtained in the step D into a graph convolution neural network (GCN), constructing a graph set representation by using pooling operation after a plurality of convolution layers, integrating all the learned information in the network through two full-connection layers, and finally outputting defect tendency by using a Softmax classifier.

It should be noted that, the convolutional neural network of the graph classification is an end-to-end learning manner, and can learn higher-level features and patterns, and step E includes the following steps E1 to E2.

It should be noted that, the Softmax function used at the end of the GCN network performs a classification task, so as to complete the construction of the graph neural network GCN model, where the Softmax function is as follows:

In the actual operation, as shown in fig. 3, the software defect prediction further performs steps i to iii.

Collecting source codes and decomposing the source codes into modules, carrying out defect marking treatment on each module, marking defective modules, and then entering a step II, wherein the step I is carried out according to the steps I1 to I3;

And II, dividing the marked data set into a training set and a testing set, respectively taking 70% and 30% of the source code data set as the training set and the testing set, and then entering the step III.

And step III, inputting the divided data set into a constructed defect prediction model, enabling the model to learn the relation between the features and the labels, optimizing and adjusting the model so as to obtain a better prediction effect, and then entering step IV, wherein the step III is carried out according to the steps III 1 and III 2.

The present invention has been described in detail with reference to the drawings and the specific examples, but the present invention is not limited to the above-described embodiments, and various changes can be made on the basis of the present invention within the knowledge of those skilled in the art.

Claims

1. A software defect prediction method based on multi-dimensional code joint graph representation is characterized by comprising the following steps of: the method comprises the following steps of constructing a software defect prediction model aiming at code content information and code annotation information in a corresponding code library, and applying the software defect prediction model to a software defect prediction task to realize defect prediction, wherein the process comprises the following steps:

submitting source code information containing the classmark into a storage library to serve as a code sample, and decomposing the code sample into a source code without annotation and a code annotation, wherein the source code without annotation comprises sub-parts: definition of functions and classes, variable and constant definition, control flow statements, error handling and exception handling;

extracting abstract grammar tree AST structures from source codes without comments aiming at the two parts in the code samples, obtaining abstract grammar tree nodes, traversing the abstract grammar tree to obtain dependency relationship characterization data and call relationship characterization data in the abstract grammar tree nodes, finally obtaining feature sequences of grammar tree nodes, dependency relationships and call relationship data, extracting word sequences from code comment texts, and coding the word sequences to obtain code comment sequences;

and C, converting the new sequence into a corresponding vector by a Word2vec model aiming at the grammar tree node and the code annotation sequence which are respectively obtained in the step B: syntax tree node vectors and code annotation vectors;

d, splicing the grammar tree node vector and the code annotation vector which are obtained from the code information into a new node vector to be used as a final node vector representation of the graph structure, and constructing a complete graph structure representation by using the dependency relationship and the calling relationship which are obtained in the step B as edges;

e, inputting the graph structure representation obtained in the step D into a graph convolution neural network, constructing a graph set representation by using pooling operation after a plurality of convolution layers, integrating all the learned information in the network through two full-connection layers, and finally outputting defect tendency by using a Softmax classifier;

step F, a graph structure prediction model is constructed based on AST word vectors corresponding to code content and word vectors corresponding to code annotation, a training task is executed aiming at the constructed graph convolutional neural network GCN model, GCN model parameters are optimized, a trained GCN prediction model is obtained, and a software defect prediction model is obtained;

based on the construction of a software defect prediction model, the software defect detection task of applying the code prediction model to code content and code annotation is realized according to the following steps I to III;

collecting source codes and decomposing the source codes into modules, and marking defective modules by performing defect marking treatment on the modules;

step II, dividing the marked data set into a training set and a testing set, wherein 70% and 30% of the source code data set are respectively used as the training set and the testing set;

and III, inputting the divided data set into a constructed defect prediction model, so that the model learns the relation between the features and the labels, and optimizing and adjusting the model to obtain a better prediction effect and a final prediction model.

2. The method for predicting the software defect based on the multi-dimensional code joint graph representation according to claim 1, wherein the method comprises the following steps of:

step B1, extracting an abstract syntax tree structure aiming at source code content submitted to a code library, analyzing structural relations, obtaining control dependency relations and calling relations, and according to the steps B1.1 to B1.4:

b1.4, traversing the constructed grammar tree when carrying out call relation analysis, extracting function call nodes and definition nodes in the grammar tree, constructing the relation between the function call nodes and the definition nodes, and obtaining characterization data of the call relation;

step B2 extracts word sequences from the source code through a natural language tool kit, wherein the extraction process generally comprises three parts of word segmentation, word stem extraction and word sequence construction, and the steps from step B2.1 to step B2.2 are performed:

3. The method for predicting software defects based on multi-dimensional code joint graph representation according to claim 1, wherein the step E comprises steps E1 to E2:

wherein the method comprises the steps ofRepresenting the transfer function of the GCN network, σ (·) representing the activation function is typically used to enhance model expression capability, W ^(l) Represents a weight matrix, H ^(l) Status information of each layer in the GCN network, L e {0,1, 2..l }, L represents the total network layer number, H ^(L) =z, Z being the output of the network model, when l=0, H ⁽⁰⁾ X, X is an information representation of each node in the graph structure.

4. The method for predicting software defects based on multi-dimensional code joint graph representation according to claim 1, wherein the step i comprises the steps of i 1 to i 3:

step 1, collecting a data set containing software defects from an open source software project, a Bug database and a software defect report;

5. The method for predicting software defects based on multi-dimensional code joint graph representation according to claim 1, wherein the step iii comprises steps iii 1 to iii 2: