CN115935372A

CN115935372A - Vulnerability detection method based on graph embedding and bidirectional gated graph neural network

Info

Publication number: CN115935372A
Application number: CN202211470625.5A
Authority: CN
Inventors: 俞东进; 黄琛; 王思轩; 金宝清; 程淑涵
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-04-07

Abstract

The invention provides a vulnerability detection method based on graph embedding and a bidirectional gated graph neural network. The method comprises the steps of firstly obtaining function-level source codes with holes and source codes without holes extracted from source codes, converting all the source codes into a program dependency graph by using a source code analysis tool, converting the program dependency graph into a graph embedding representation of the codes by using an improved node2vec method, wherein the representation not only contains graph structure information of the source codes, but also contains text structure information of the source codes, so that the capability of representing nonlinear information of features is improved to a certain extent, and finally performing deep learning training on the preprocessed code embedding through a bidirectional gated graph neural network model. And applying the training result to the target program, and detecting and evaluating the code vulnerability of the target program.

Description

Vulnerability detection method based on graph embedding and bidirectional gated graph neural network

Technical Field

The invention relates to the field of preprocessing of source codes and vulnerability detection in software programs, in particular to a code vulnerability detection method based on graph embedding and a bidirectional gated graph neural network.

Background

Software bugs are responsible for many system attacks and data leakage events. Machine learning is a viable means of identifying common software vulnerabilities by building tools and models. Since different vulnerabilities may exhibit similar underlying patterns, machine learning may first learn the underlying patterns of a vulnerability program expression from training samples and then apply these patterns to new software projects to identify potential vulnerability code.

Recently, researchers have learned the program structure of source code using deep learning to identify potential software bugs in the source code. Compared with the classic machine learning technology, the deep learning technology has the advantage that structural features can be automatically learned from training samples, and experts are not required to participate in the process of manually optimizing the program structure. Existing deep learning-based procedural modeling approaches typically use a Recurrent Neural Network (RNN), such as Long Short Term Memory (LSTM) or variants thereof. However, LSTM are designed for sequential sequences and are not suitable for program structure control and data flow modeling. Therefore, previous LSTM-based methods can only capture shallow, superficial structural or grammatical information of the source code text, and cannot adequately learn the semantic features of more significant information in the program structure.

In order to better perform feature learning on a complex code structure, the invention provides a method which can directly operate a program structure diagram and learn semantic information from the diagram. Doing so would allow the model to hold a large amount of control-dependent and data-dependent information to capture the underlying code structure of many software vulnerabilities. Aiming at the problems and the practical significance, the invention improves the capability of data preprocessing, fully learns the graph structure information of the code, and trains the bidirectional gate control graph neural network (BGGNN) by optimizing parameter setting so as to realize better detection performance.

Disclosure of Invention

In order to solve the problem that the existing static vulnerability mining method cannot effectively represent the nonlinear semantic information in the code graph structure and effectively improve the neural network model effect, the invention provides a vulnerability detection method based on graph embedding and a bidirectional gated graph neural network, which can effectively solve the problem. The technical scheme adopted by the invention is as follows:

a vulnerability detection method based on graph embedding and a bidirectional gated graph neural network comprises the following steps:

s1, acquiring and labeling a data set, and specifically comprising the following substeps:

s11, acquiring a source code data set, and extracting function-level source codes containing holes and source codes without holes from the source code data set, wherein the function-level source codes and the source codes do not contain holes and comprise k functions;

s12, whether each function contains the vulnerability is marked, and a mark Y of each function file is obtained _i ∈{0,1},i∈[1,k]Where 0 indicates that no holes exist and 1 indicates that holes exist.

And S2, generating a program dependency graph, and obtaining a program dependency graph set G = { V, E } corresponding to all source codes in the whole project, wherein V represents a set of nodes, and E represents a set of edges. The method specifically comprises the following substeps:

and S21, after the source code is imported into a source code analysis tool, using a query statement as input according to a function name in the source code. Generating a Program Dependency Graph (PDG) corresponding to the function name and outputting the PDG as a dot type graph description file;

and S22, mapping the variable names and function names defined by the user to the symbol names in a PDG graph description file in a one-to-one mode by using a uniform variable name mapping mode to obtain the preprocessed PDG graph.

And S3, extracting required side information and node text information for all the program dependency graphs. The method specifically comprises the following substeps:

s31, extracting the directed edge relation E between the nodes in the dot file in a regular matching mode _ij ＝V _i →V _j Acquiring all directed edge sets and storing the directed edge sets as text files;

s32, extracting a code text V corresponding to the node ID in the dot file in a regular matching mode _i ＝[Text ₁ ,Text ₂ ,...,Text _n ]And acquiring all node text sets and storing the node text sets as dictionary files.

S4, using node2vec to perform feature training to obtain a feature vector dictionary, and specifically comprising the following substeps:

s41, the Text file which is preprocessed in S31 and stored with the directed edges is used as input, sampling strategy parameters in a node2vec model are reasonably set, text features are trained, and a minimum Text unit Text is output _i Corresponding vector _ t _i ,i∈[1,n]；

S42, storing all output text feature vectors by using a dictionary, wherein the dictionary is Dict _ t = { [ U ]) _i∈[1,n] {key:Text _i ,value:vector_t _i }；

S43, the text file which is preprocessed in S31 and stored with the directed edges is used as input, sampling strategy parameters in a node2vec model are reasonably set, and the node is identified as a unique node ID _i Instead of the text attributes described above, the node dependency features are trained, and the dependency feature vector _ n between the output graph nodes _i ,i∈[1,m]；

S44, storing all output node dependence characteristic vectors by using a dictionary, wherein the dictionary is Dict _ n = $ U _i∈[1,m] {key:ID _i ,value:vector_n _i }。

S5, based on the text feature vector and the edge feature vector obtained by training in S4, converting all PDGs into matrix representation of feature vectors at function level, and specifically comprising the following substeps:

s51, merging Text description representing nodes into one line, splitting a character string into a plurality of texts, and converting the Text attributes of the nodes into corresponding embedded vectors nodeTextvec based on a Text vector dictionary Dict _ t obtained in S42 _i ＝[vector_t _i1 ,vector_t _i2 ,...,vector_t _in ]So as to obtain a text vector of each node;

s52. A directed edge represents that it has a pair of head node and tail node, by using the ID of the two nodes _s ,ID _e The node ID dictionary obtained in the step S44 is inquired as a key to obtain a head node vector _ n _s And tail node vector _ n _e ；

S53, head section is connectedSubtracting the point vector and the tail node vector to obtain an embedded vector v corresponding to a directed edge _s→e ＝vector_n _e -vector_n _s . The above processing is carried out on each directed edge in the list of each program dependency graph to obtain edge vectors in all PDGs

And S54, encapsulating the node text vector and the edge vector into a JSON file corresponding to the program dependency graph as the input of a subsequent neural network model. The JSON file can be viewed as a combination of an N × 16 two-dimensional vector matrix and an M × 16 two-dimensional vector matrix, where N represents the number of nodes in a program dependency graph and M represents the number of edges.

S6, taking the JSON files output by the S5 as input, training a neural network model of the bidirectional gate control diagram, and specifically comprising the following substeps:

s61, segmenting a training set and a testing set: selecting d% of data samples in the JSON file data set generated in S53 as a training set, and the rest as a test set;

s62, learning characteristic data contained in the data set by applying a bidirectional gate control graph neural network (BGGNN). BGGNNs consist of two directional Gated Graphs Neural Networks (GGNN): one is a forward direction L ₁ Gated Graph Neural Network (GGNN) of layers ₁ Accepting a forward input; the other is reverse L ₂ Gated Graph Neural Network (GGNN) of layers ₂ Learning the input in reverse, the formula is expressed as:

in the above formula, y _t Is the output of the model and is,

is output forward, is asserted>

Is the reverse output;

and S63, carrying out iteration training for l times based on the network, and storing the neural network Model after the training is finished so as to facilitate rapid Model loading at a later stage.

S7, carrying out code vulnerability detection on the target program, and specifically comprising the following substeps:

s71, firstly, preprocessing a target program source code as in the steps of S2 and S3 to obtain a preprocessed PDG;

s72, performing the step of converting the PDG to the characteristic vector matrix in S5 on the basis of the pre-trained dictionaries Dict _ t and Dict _ n in S4, and storing the converted PDG as a JSON file of a target program;

s73, multiplexing the neural network Model generated in the S6, and taking a function-level eigenvector matrix generated by the target program as input to perform function-level code vulnerability detection;

s74, the method outputs a list to list the function names of the potential code bugs existing in the target program so as to allow relevant personnel to check and perfect the program.

Preferably, the positive direction L in step S62 ₁ Gated Graph Neural Network (GGNN) of layers ₁ ，L ₁ And taking 3.

Preferably, the reverse direction of L in step S62 ₂ Gated Graph Neural Network (GGNN) of layers ₂ ，L ₂ And taking 3.

Preferably, the iterative training is performed for l times, i is 150, as described in step S63.

The invention has the following beneficial effects:

the code vulnerability detection method based on graph embedding learning uses a real vulnerability data set to extract control flow and dependency relationship from source codes, so that the information expressed by the codes is more specific and comprehensive. After the code is subjected to node2vec training, the graph embedding information and the text embedding information of the code are learned. The two-way gated graph neural network BGGNN model architecture is used as a classifier, wherein a cycle structure can effectively learn neighborhood information of graph nodes, good performance is achieved, an objective function is optimized by using random gradient rise, and the feature resolution capability is effectively improved. By properly adopting some training skills, ideal network parameters, an optimization algorithm and the setting of the learning rate are selected, the network is more stable, the result is more reliable, and the accuracy of code vulnerability detection is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a neural network for training a two-way gating pattern for vulnerability detection according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

The embodiment provides a code vulnerability detection method based on graph embedding and a two-way gated graph neural network, as shown in fig. 1, comprising the following steps:

s12, whether each function contains the vulnerability is marked, and a mark Y of each function file is obtained _i ∈{0,1},i∈[1,k]Where 0 represents that there is no vulnerability present,1 indicates the presence of a vulnerability.

and S21, after the source code is imported into a source code analysis tool, using a query statement as input according to a function name in the source code. Generating a Program Dependency Graph (PDG) corresponding to the function name and outputting the PDG as a graph description file of the dot type;

s41, taking the Text file which is stored with the directed edges and is preprocessed in S31 as input, reasonably setting sampling strategy parameters in a node2vec model, training Text characteristics, and outputting a minimum Text unit Text _i Corresponding vector _ t _i ,i∈[1,n]；

S43, taking the text file which is preprocessed in the S31 and stores the directed edges as the text fileInputting, reasonably setting sampling strategy parameters in the node2vec model, and identifying the node as a unique node ID _i Instead of the text attributes described above, the node dependency features are trained, and the dependency feature vector _ n between the output graph nodes _i ,i∈[1,m]；

In this embodiment, in S41 and S43, parameters in the node2vec model are set reasonably, and the specific parameters include walk _ length, num _ walks, p, q, and window _ size, which respectively represent the walk length, the walk frequency, the probability of accessing the previous node, whether the walk direction is partial depth-first or breadth-first (q <1 is partial depth-first, q >1 is partial breadth-first), and the window size, where walk _ length takes 10, num \_walks takes 10, p takes 0.1, q takes 0.8, and window_size takes 5.

s52. A directed edge represents that a pair of head node and tail node exist, and the ID of the two nodes is used _s ,ID _e The node ID dictionary obtained in the step S44 is inquired as a key to obtain a head node vector _ n _s And head node vector _ n _e ；

S53, subtracting the head node vector and the tail node vector to obtain an embedded vector v corresponding to a directed edge _s→e ＝vector_n _e -vector_n _s . The above processing is carried out on each directed edge in the list of each program dependency graph to obtain edge vectors in all PDGs

s61, segmenting a training set and a testing set: selecting d% of data samples in the JSON file data set generated in S53 as a training set, and the rest as a test set; wherein d is 70.

S62, learning characteristic data contained in the data set by applying a bidirectional gate control graph neural network (BGGNN). BGGNNs consist of two directional Gated Graphs Neural Networks (GGNN): one is a forward direction L ₁ Gated Graph Neural Network (GGNN) of layers ₁ Receiving a positive input; the other is reverse L ₂ Gated Graph Neural Network (GGNN) of layers ₂ Learning the input in reverse, the formula is expressed as:

in the above formula, y _t Is the output of the model and is,

is output forward, is asserted>

Is an inverted output, wherein L ₁ Get 3,L ₂ Taking 3;

and S63, carrying out iteration training for l times based on the network, and storing a neural network Model after the training is finished so as to facilitate rapid Model loading in the later period, wherein l is 150.

s74, the method outputs a list which lists the function names of the potential code bugs existing in the target program so as to be used by related personnel for checking and perfecting the program.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments, including the components, without departing from the principles and spirit of the invention, and still fall within the scope of the invention.

Claims

1. A vulnerability detection method based on graph embedding and a bidirectional gated graph neural network is characterized by comprising the following steps:

s1, acquiring a data set of a source code and marking whether a vulnerability exists or not;

s2, generating a program dependency graph, and obtaining a program dependency graph set G = { V, E } corresponding to all source codes in the whole project, wherein V represents a set of nodes, and E represents a set of edges;

s3, extracting a required directed edge set and a required node set for all the program dependency graphs, and respectively storing the directed edge set and the node set as a text file and a dictionary file;

s4, carrying out feature training by using the node2vec to obtain a feature vector dictionary;

s5, obtaining text characteristic vectors and edge characteristic vectors based on the training in the S4, and converting all PDGs into matrix representation of the characteristic vectors at the function level;

s6, taking the JSON files output in the S5 as input, and training a neural network model of the bidirectional gating graph;

and S7, carrying out code vulnerability detection on the target program, and carrying out steps S1-S6 on the target program code to finish vulnerability detection.

2. The vulnerability detection method based on graph embedding and bidirectional gated graph neural network according to claim 1, characterized in that the labeling method of the source code in the data set in S1 is: extracting function-level source codes containing holes and source codes without holes from a source code data set, wherein the function-level source codes and the source codes without holes comprise k functions; whether each function contains the vulnerability is marked to obtain the mark Y of each function file _i ∈{0,1},i∈[1,k]Where 0 indicates that no holes exist and 1 indicates that holes exist.

3. The vulnerability detection method based on graph embedding and bidirectional gated graph neural network according to claim 1, wherein in S2, the method for generating the program dependency graph is as follows:

s21, after a source code is imported into a source code analysis tool, according to a function name in the source code, using a query statement as input, generating a program dependency relationship graph corresponding to the function name, and outputting the graph dependency relationship graph as a dot type graph description file;

and S22, mapping the variable names and function names defined by the user to the symbol names in a one-to-one mode in the description file of the program dependency graph by using a unified variable name mapping mode to obtain the preprocessed program dependency graph.

4. The vulnerability detection method based on graph embedding and bidirectional gated graph neural network according to claim 1, characterized in that in S3, a regular matching manner is used to extract a required directed edge set and a required node set.

5. The vulnerability detection method based on graph embedding and bidirectional gated graph neural network according to claim 4, wherein the feature vector dictionary in S4 comprises a text vector dictionary and a node ID dictionary, the text vector dictionary takes a text file as input to obtain a text feature vector and uses one dictionary for storage, and the node ID dictionary takes a dictionary file as input to obtain a node dependent feature vector and uses one dictionary for storage.

6. The vulnerability detection method based on graph embedding and bidirectional gated graph neural network according to claim 5, wherein the text vector dictionary obtaining method is as follows: taking the Text file which is preprocessed in the S31 and stored with the directed edge as input, setting sampling strategy parameters in a node2vec model, training Text characteristics, and outputting a minimum Text unit Text _i Corresponding vector _ t _i ,i∈[1,n]；

Storing all output text feature vectors by using a dictionary, wherein the dictionary is Dict _ t = & _i∈[1,n] {key:Text _i ,value:vector_t _i }。

7. The vulnerability detection method based on graph embedding and bidirectional gated graph neural network according to claim 6, wherein the node ID dictionary obtaining method is as follows:

and taking the text file which is preprocessed in the S31 and stored with the directed edge as input, reasonably setting sampling strategy parameters in the node2vec model, and identifying the node as a unique node ID _i Instead of the text attributes described above, the node dependency features are trained, and the dependency feature vector _ n between the output graph nodes _i ,i∈[1,m]；

Using one word for all output node dependent feature vectorsDictionary is saved, and the dictionary is Dict _ n = $ U _i∈[1,m] {key:ID _i ,value:vector_n _i }。

8. The vulnerability detection method based on graph embedding and bidirectional gated graph neural network according to claim 7, wherein the S5 specifically comprises the following sub-steps:

s51, merging Text description representing nodes into one line, splitting a character string into a plurality of texts, and converting the Text attributes of the nodes into corresponding embedded vectors based on a Text vector dictionary Dict _ t

nodeTextvec _i ＝[vector_t _i1 ,vector_t _i2 ,...,vector_t _in ]So as to obtain a text vector of each node;

s52. A directed edge represents that it has a pair of head node and tail node, by using the ID of the two nodes _s ,ID _e And querying the node ID dictionary as a key to obtain a head node vector _ n _s And tail node vector _ n _e ；

S53, subtracting the head node vector and the tail node vector to obtain an embedded vector v corresponding to a directed edge _s→e ＝vector_n _e -vector_n _s The above processing is carried out on each directed edge in the list of each program dependency graph to obtain edge vectors in all PDGs

And S54, encapsulating the node text vector and the edge vector together into a JSON file corresponding to the program dependency graph, and taking the JSON file as the input of a subsequent neural network model, wherein the JSON file is regarded as the combination of an Nx 16 two-dimensional vector matrix and an Mx 16 two-dimensional vector matrix, wherein N represents the number of nodes in the program dependency graph, and M represents the number of edges.

9. The vulnerability detection method based on graph embedding and bidirectional gated graph neural network according to claim 8, wherein the S6 specifically comprises the following sub-steps:

s62, learning characteristic data contained in the data set by using a bidirectional gated graph neural network, wherein the BGGNN consists of two directional gated graph neural networks: one is a forward direction L ₁ Gated Graph Neural Network (GGNN) of layers ₁ Accepting a forward input; the other is a reverse L ₂ Gated Graph Neural Network (GGNN) of layers ₂ Learning the input in reverse, the formula is expressed as:

in the above formula, y _t Is the output of the model, and is,

is forward output, is greater than or equal to>

Is the reverse output;

and S63, carrying out iteration training for l times based on the network, and storing a neural network Model after the training is finished so as to facilitate rapid Model loading in the later period.

10. The vulnerability detection method based on graph embedding and bidirectional gated graph neural network according to claim 8, wherein the S7 specifically comprises the following sub-steps:

s71, firstly, preprocessing a target program source code as in the steps S2 and S3 to obtain a preprocessed program dependency graph;

and S74, outputting a list, and listing a function name list of potential code bugs existing in the target program for relevant personnel to check and perfect the program.