CN115935367A - Static source code vulnerability detection and positioning method based on graph neural network - Google Patents

Static source code vulnerability detection and positioning method based on graph neural network Download PDF

Info

Publication number
CN115935367A
CN115935367A CN202211357260.5A CN202211357260A CN115935367A CN 115935367 A CN115935367 A CN 115935367A CN 202211357260 A CN202211357260 A CN 202211357260A CN 115935367 A CN115935367 A CN 115935367A
Authority
CN
China
Prior art keywords
node
graph
function
calling
vulnerability detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211357260.5A
Other languages
Chinese (zh)
Inventor
李玉军
张浩杰
刘艺玮
周楠馨
李宗讯
侯孟书
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202211357260.5A priority Critical patent/CN115935367A/en
Publication of CN115935367A publication Critical patent/CN115935367A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a static source code vulnerability detection and positioning method based on a graph neural network, and belongs to the technical field of network information security. Sequentially performing lexical and syntactic analysis on a target program source code to generate an abstract syntax tree; adding a control flow graph, a control dependency graph and a data dependency graph on the basis of the abstract syntax tree to form a code attribute graph in a combined mode; searching a node of a calling type in the code attribute graph, and establishing a function calling graph according to the found node information of the calling type; adding the function call graph into the code attribute graph, and associating the control flow graph with the data dependency graph by using the function call graph to form an associated code attribute graph; and constructing a vulnerability detection and positioning model based on the graph neural network, and inputting the associated code attribute graph into the model to complete vulnerability detection and positioning. Compared with the existing vulnerability detection method, the vulnerability detection and positioning method provided by the invention has the advantages that the vulnerability detection and positioning are realized at the same time, and the accuracy is high.

Description

Static source code vulnerability detection and positioning method based on graph neural network
Technical Field
The invention belongs to the technical field of network information security, and particularly relates to a static source code vulnerability detection and positioning method based on a graph neural network.
Background
At present, the main technical means of vulnerability detection is divided into dynamic detection and static detection according to whether codes need to be operated or not.
The dynamic detection method carries out vulnerability detection by directly operating the target program and can be implemented under the condition of no source code. Besides open source software, the source code of other software is difficult to obtain, and in this case, dynamic detection is a better means. This inspection method requires manually constructing a large amount of test data, which is given as input to the software to be inspected. And observing the state and the execution result of the software execution process, comparing the actual output with the expected result, and judging whether the tested software system has a bug. The method has the advantage of high accuracy, but the efficiency is low, and data needs to be continuously input for judgment. Due to the difference of output results, dynamic detection can only judge whether a vulnerability exists or not. If the position of the vulnerability is determined and the specific reason for triggering the vulnerability is determined, further data tracking and analysis are needed, a large amount of time and manpower are consumed in the process, the efficiency is low, and an analyst needs to have professional field knowledge, so that the method is not suitable for a large-scale software system and only can detect a small-scale system.
The static detection method does not need to execute a program, but directly analyzes the program source code, obtains the grammatical and semantic information of the source code through the technologies of lexical analysis, grammatical analysis, semantic analysis and the like, and discovers the potential bugs in the program by utilizing the methods of similarity analysis, bug rule matching and the like. Static detection generally requires that a source code is converted into intermediate representation forms such as token, tree, graph and the like, and a proper detection algorithm and a proper detection model are selected for detection according to a specific scene.
Based on different code intermediate representation forms, the static detection method can be divided into a token-based source code vulnerability detection method, a tree-based source code vulnerability detection method and a graph-based source code vulnerability detection method. The token-based source code vulnerability detection method needs lexical analysis on a source code program to obtain a token sequence. The Token sequence can embody lexical information of a program, the effect is good when the copy vulnerability detection is carried out on a source code, but more grammatical information cannot be embodied, so that the Token-based detection method is small in application range, and the detectable vulnerability type is limited. The tree-based source code vulnerability detection method uses a tree structure such as an abstract syntax tree and the like as a code-to-intermediate representation form, compared with a token sequence, part of syntax information is supplemented, the effect in vulnerability detection is improved to a certain extent, but the tree structure still cannot reflect information such as control flow, data flow and the like, so that the tree-based source code vulnerability detection method is difficult to apply to a large-scale software system. The source code vulnerability detection method based on the graph uses nodes to represent grammar units in a source code, and edges represent association relations among the grammar units. The graph structure comprehensively considers the grammatical and semantic features of the program, the effect is good, the application range is wide, the complexity of constructing the graph is high, and more grammatical, semantic and word order analysis needs to be carried out on the source code in the process.
Based on different detection principles, static detection methods are divided into a source code vulnerability detection method based on code similarity, a source code vulnerability detection method based on rules and a source code vulnerability detection method based on deep learning. The core idea of the vulnerability detection method based on code similarity is that code segments similar to vulnerability codes are highly likely to contain the same type of vulnerabilities. The method extracts code features from the code segments, compares the similarity of the code features with the vulnerability code segments, and judges whether the same vulnerability exists. Although the method can detect the same bug caused by similar codes, a large number of known bug code segments are needed, and the processing is troublesome. In addition, the same loophole can be caused by dissimilar codes, so the method has large limitation and high false negative rate. The core idea of the source code vulnerability detection method based on the rules is to perform vulnerability mode matching on codes, summarize and generalize vulnerability characteristics and formulate corresponding rules. And if the source code has a rule mode which is identical with the vulnerability, judging that the vulnerability exists. The method has the characteristics of low false alarm rate and high precision ratio. However, in the early stage, a large amount of vulnerability codes need to be manually analyzed, rules are formulated, time and labor are consumed, and the detection result highly depends on the quality and coverage of the formulated rules. The source code vulnerability detection method based on deep learning distinguishes feature engineering, utilizes a deep model to automatically extract code features and generate vulnerability modes, and has better detection effect than the traditional machine learning method. Although the method has a certain effect, the detection granularity of most methods is function level, bugs in codes cannot be positioned, and the method has the problems of coarse detection granularity, incapability of realizing cross-function detection and the like.
A new vulnerability detection method is researched from the aspects of vulnerability detection granularity, vulnerability positioning, model interpretation and the like, and has positive significance for improving the effect of source code vulnerability detection.
Disclosure of Invention
The invention aims to: the method is used for solving the problems that the detection granularity is rough, the vulnerability cannot be positioned and the like in the current deep learning-based source code vulnerability detection.
In order to achieve the purpose, the invention adopts the following technical scheme:
a static source code vulnerability detection and positioning method based on a graph neural network comprises the following steps:
s1, acquiring a source code of a target program:
s2, performing lexical analysis and syntactic analysis on the source code of the target program in sequence, and then generating an abstract syntax tree according to a final analysis result;
s3, adding a control flow diagram, a control dependency diagram and a data dependency diagram on the basis of the abstract syntax tree to form a code attribute diagram;
s4, searching a calling type node in the code attribute graph obtained in the S3, and establishing a function calling graph according to the found calling type node information;
s5, adding the function call graph into the code attribute graph, and associating the control flow graph with the data dependency graph by using the function call graph to form an associated code attribute graph;
and S6, constructing a vulnerability detection and positioning model based on the graph neural network, and inputting the associated code attribute graph obtained in the step S5 into the model to complete vulnerability detection and positioning.
Further, the step S4 is to establish a function call graph by the following steps;
s4.1, searching a node a called by the function in the code attribute graph of the function f0 to obtain a function name f1, a parameter list and a parameter type called by the node.
S4.2, traversing all nodes with the incomes of 0 in the code attribute graph, and matching the nodes with the function names, parameter lists and parameter types called by the nodes a to obtain calling function nodes and called function nodes corresponding to the nodes a;
and S4.3, adding function calling edges between the calling function nodes and the called function nodes obtained in the S4.2 for correlation, and establishing a function calling graph, wherein in the correlation process, the direction of the function calling edges is that the calling function nodes point to the called function nodes.
Further, the operation process of S4.2 is:
s4.2.1, marking the current node as v, comparing whether the function name f1 of the node a is the same as the function name f2 of the node v, and if so, switching to S4.2.2; if the difference is different, the operation is switched to S4.2.4;
s4.2.2, comparing whether the number of the parameter lists of the nodes a and v is the same, and if so, turning to S4.2.3; if the difference is not the same, the operation is switched to S4.2.4;
s4.2.3, sequentially comparing whether each parameter type in the parameter list is the same, and if all the parameter types are the same, converting to S4.2.5; if one or more parameters are different, converting to S4.2.4;
s4.2.4, continuously traversing to obtain the next function node and node information, marking as v, and converting to S4.2.1;
and S4.2.5, outputting a calling function node and a called function node.
Further, in order to improve vulnerability identification and positioning accuracy, a control flow graph adjustment algorithm and a data dependency graph association algorithm are adopted in the step S5 to adjust the associated code attribute graph;
s5.1, the adjustment process of the control flow graph adjustment algorithm on the associated code attribute graph is as follows:
(a1) Obtaining a calling function node a and a called function node b according to the function calling graph edge;
(a2) Searching edges which are associated with the calling function node a and have a control flow graph relation, and according to the direction of the control flow graph edges, enabling the edges to reach a parent node set V1, a child node set V2 and an edge set E1 of the node a;
(a3) Searching an edge which is associated with the called function node b and has a control flow graph relation, and obtaining a child node set V3 of the node b according to the direction of the control flow graph edge;
(a4) Traversing the control flow graph edge in the code attribute graph of the called function node b to obtain a leaf node c in the control flow graph, namely a node with an in-degree of 1 and an out-degree of 0;
(a5) Sequentially connecting a father node set V1 of a calling function node a with a child node set V3 of a called function node b;
(a6) Sequentially connecting the node c of the called function node b with the child node set V2 of the calling function node a;
(a7) And deleting the edge set E1 of the calling function node a.
The adjustment process of the data dependency graph association algorithm on the association code attribute graph is as follows:
(b1) Obtaining a calling function node a and a called function node b according to the function calling graph edge;
(b2) Searching a child node set V1 of a calling function node a, namely a transferred parameter node set; if the parent node of the calling function node is an operator node such as a value node, the left sibling node of the calling function node a is recorded as a node c.
(b3) Searching a child node set V3 of a called function node b, wherein the child node set is a parameter node set and other statement root nodes of a called function; if the called function has a return value, acquiring a node d corresponding to the return value;
(b5) Sequentially connecting a child node set V2 of a calling function node a with a child node set V3 of a called function node b according to the corresponding parameter type and parameter sequence;
(b6) And according to the type of the return value, comparing whether the return value node d of the called function is consistent with the sibling node c of the calling function, if so, adding a data dependence graph edge between the node c and the node d for association, wherein the direction of the data dependence edge points to the node d from the node c.
Further, the vulnerability detection and location model based on the graph neural network constructed in S6 includes: the method comprises the following steps that a VecCPG vectorization model, a vulnerability detection model DMGGAT based on a graph attention network and a vulnerability location model LDMGGAT based on an attention mechanism are adopted;
the vectorization model VecCPG is input into an associated code attribute graph and used for vectorizing nodes and edges in the associated code attribute graph, and acquiring a feature matrix and an adjacency matrix according to a vectorization processing result, wherein the adjacency matrix comprises an AST adjacency matrix, a CFG adjacency matrix and a DDG adjacency matrix;
the input of the vulnerability detection model DMGGAT based on the graph attention network is a characteristic matrix and an adjacent matrix, and the vulnerability detection model DMGGAT is used for realizing vulnerability identification of a source code according to the input characteristic matrix and the input adjacent matrix;
the vulnerability positioning model based on the attention mechanism has the input of a feature matrix and an adjacent matrix and is used for realizing vulnerability positioning of a source code according to the input feature matrix and the adjacent matrix.
Furthermore, the structure of the vulnerability detection model based on the graph attention network is as follows: the system comprises three layers of multi-head graph attention layers, a linear layer, a Softmax function layer and an output layer which are sequentially connected; the three graph attention network layers access the AST adjacency matrix, the CFG adjacency matrix, and the DDG adjacency matrix in sequence.
After the technical scheme is adopted, the invention has the following advantages:
1. the invention provides an improved code attribute graph, namely an associated code attribute graph. And on the basis of the code attribute graph, adding a Function Calling Graph (FCG) according to the calling relation among functions, and associating the CFG of the control flow graph of the calling function and the guided function with the DDG. The associated code attribute graph reflects the information of function call relation, parameter transmission and execution sequence, so that cross-function vulnerability detection can be realized during vulnerability detection, and the vulnerability detection granularity is refined.
2. The vulnerability detection model DMGGAT based on the graph attention network uses three layers of multi-head attention network layers, AST, CFG and DDG adjacency matrixes are sequentially input to update feature vectors, potential feature representation of a source code attribute graph is obtained, graph classification is carried out according to code features, and therefore vulnerability detection is achieved. In order to verify the effectiveness of the DMGGAT model, five vulnerability types of CWE121, CWE122, CWE369, CWE416 and CWE476 are selected from the Julie data set to carry out vulnerability detection experiments. The experimental result shows that the DMGGAT model has better prediction effect in five CWE vulnerability types, and F1 scores respectively reach 95.99%, 90.36%, 95.86%, 96% and 96.62%. The DMGGAT model makes full use of different adjacency relations among nodes in the associated code attribute graph, so that the accuracy rate and the recall rate are high, and the detection effect of the model is obvious.
3. On the basis of vulnerability detection, the invention provides a vulnerability localization model LDMGGAT based on an attention mechanism, and the average attention value of each node is calculated by obtaining the intermediate output of a multi-head graph attention neural network, namely the attention value matrix corresponding to each associated code attribute graph. And then obtaining the first k nodes with larger attention values according to an IQR algorithm, and looking up the row number information corresponding to the nodes, thereby realizing the positioning function of the vulnerability. The positioning accuracy of the five CWE loophole types is over 80 percent, and the positioning effect is good.
Drawings
FIG. 1 is a block diagram of a static detection and location method for a source code vulnerability based on a graph neural network according to the present invention;
FIG. 2 is a process for constructing a code attribute diagram according to the present invention;
FIG. 3 is a structure of a vulnerability detection model based on an attention mechanism according to the present invention;
FIG. 4 is a structure of a vulnerability localization model based on an attention mechanism according to the present invention.
Detailed Description
As shown in fig. 1, the method for statically detecting and positioning a source code vulnerability based on a graph neural network provided in this embodiment includes the following steps:
s1, obtaining a source code of a target program.
And S2, performing lexical analysis and syntactic analysis on the source code of the target program in sequence, and then generating an abstract syntax tree according to a final analysis result. In the embodiment, lexical analysis is performed by adopting a method of scanning a source code from left to right character by character. A top-down grammar analysis method or a bottom-up grammar analysis method is adopted in the grammar analysis process.
And S3, adding a control flow diagram, a control dependency diagram and a data dependency diagram on the basis of the abstract syntax tree generated in the S2, and combining to form a code attribute diagram. The code attribute map is shown in fig. 2. When the code has function call, the real execution process of the program is to switch from the function call node to the called function, and after the called function is executed, the program returns to the original position and continues to execute the code after the call node. Therefore, in order to embody complete control flow information and data flow information during program operation, a control flow graph CFG and a data dependency graph DDG in a code attribute graph corresponding to a calling function and a called function need to be associated.
And S4, searching a node of the calling type in the code synthesis attribute graph obtained in the S3, and establishing a function calling graph according to the found node information of the calling type. The detailed process comprises the following steps:
s4.1, searching a node a called by the function in the code attribute graph of the function f0 to obtain a function name f1, a parameter list and a parameter type called by the node.
And S4.2, traversing all the nodes with the incomes of 0 in the code synthesis attribute graph, and matching the nodes with the function names, parameter lists and parameter types called by the nodes a to obtain calling function nodes and called function nodes corresponding to the nodes a.
S4.2.1, marking the current node as v, comparing whether the function name f1 of the node a is the same as the function name f2 of the node v, and if so, switching to S4.2.2; if the difference is different, the operation is switched to S4.2.4;
s4.2.2, comparing whether the number of the parameter lists of the nodes a and v is the same, and if so, turning to S4.2.3; if not, turning to S4.2.4;
s4.2.3, sequentially comparing whether each parameter type in the parameter list is the same, and if all the parameter types are the same, converting to S4.2.5; if one or more parameters are different, converting to S4.2.4;
s4.2.4, continuously traversing to obtain the next function node and node information, marking as v, and converting to S4.2.1;
and S4.2.5, outputting a calling function node and a called function node.
And S4.3, adding function calling edges between the calling function nodes and the called function nodes obtained in the S4.2 for correlation, and establishing a function calling graph, wherein in the correlation process, the direction of the function calling edges points to the called function nodes from the calling function nodes.
And S5, adding the function call graph into the code attribute graph, and associating the control flow graph and the data dependency graph by using the function call graph to form an associated code attribute graph.
(a1) Obtaining a calling function node a and a called function node b according to the function calling graph edge;
(a2) Searching edges which are associated with the calling function node a and have a control flow graph relation, and obtaining a parent node set V, a child node set V2 and an edge set E1 of the node a according to the direction of the control flow graph edges;
(a3) Searching an edge which is associated with the called function node b and has a control flow graph relation, and obtaining a child node set V3 of the node b according to the direction of the control flow graph edge;
(a4) Acquiring leaf nodes c in the control flow graph, namely nodes with an in-degree of 1 and an out-degree of 0, from the control flow graph edges in the code attribute graph of the called function node b;
(a5) Sequentially connecting a father node set V1 of a calling function node a with a child node set V3 of a called function node b;
(a6) Sequentially connecting the node c of the called function node b with the child node set V2 of the calling function node a;
(a7) And deleting the edge set E1 of the calling function node a.
The adjustment process of the data dependency graph association algorithm on the association code attribute graph is as follows:
(b1) Obtaining a calling function node a and a called function node b according to the function calling graph edge;
(b2) Searching a child node set V1 of a calling function node a, namely a transferred parameter node set; if the parent node of the calling function node is an operator node such as a value node, the left sibling node of the calling function node a is recorded as a node c.
(b3) Searching a child node set V3 of a called function node b, wherein the child node set is a parameter node set and other statement root nodes of a called function; if the called function has a return value, acquiring a node d corresponding to the return value;
(b5) Sequentially connecting a child node set V2 of a calling function node a with a child node set V3 of a called function node b according to the corresponding parameter type and parameter sequence;
(b6) And according to the type of the return value, comparing whether the return value node d of the called function is consistent with the sibling node c of the calling function or not, if so, adding a data dependency graph edge between the node c and the node d for association, wherein the direction of the data dependency edge points to the node d from the node c.
And S6, constructing a vulnerability detection and positioning model based on the graph neural network, and inputting the associated code attribute graph obtained in the step S5 into the model to complete vulnerability detection and positioning.
The vulnerability detection and location model based on the graph neural network constructed by the embodiment comprises a VecCPG vectorization model, a vulnerability detection model DMGGAT based on the graph attention network and a vulnerability location model LDMGGAT based on the attention mechanism. The vulnerability detection model DMGGAT based on the graph attention network inputs an adjacency matrix and is used for realizing vulnerability identification of a source code according to the input adjacency matrix; the vulnerability localization model LDMGGAT based on the attention mechanism inputs a feature matrix and is used for realizing vulnerability localization of source codes according to the input feature matrix.
The VecCPG vectorization model designs a vectorization rule from four aspects of Label, function, constant and type, and describes the node characteristics according to the vectorization rule. In this embodiment, each node is represented as a 143-dimensional feature vector, and all the node vectors of the associated code attribute map constitute a feature matrix corresponding to the function.
The VecCPG vectorization rule set by the VecCPG vectorization model in this embodiment is as follows:
label: the combined associated code attribute graph of the embodiment determines 23 label types. Some types have little effect on vulnerability detection. Therefore, the present embodiment screens 12 core types, such as BLOCK, CALL, CONTROL _ stride, LOCAL, and METHOD. Label for each node was vectorized using One-hot unique encoding, for a total of 12 bits.
Function: the generalized functions are divided into three types, namely an operational character, a system function and a user self-defined function which are built in a program. The present embodiment screens out 59 different types of operators and 39 system functions, which are also encoded using unique hot codes. In addition, 1 flag bit is used for all three types, and the total number is 101.
Constant values: the constants in the code attribute graph combined by the embodiment can be divided into three types: character, string, numeric type. Each type contains 1 flag bit and 3 content bits, and has 12 bits in total.
Variable types: basic types of variables include char, int, short, float, double, long, string, void, struct, and 10 complex types include signal, unsigned, star, array, map, and vector. In addition, 1 unknown bit and 1 reserved bit are needed, and the total number of bits is 18.
After vectorizing the information of each node according to the rules, the VecCPG vectorization model vectorizes the relationship between the nodes according to the abstract syntax tree AST, the control flow diagram CFG, and the data dependency diagram DDG, and forms a syntax tree AST adjacency matrix, a control flow diagram CFG adjacency matrix, and a data dependency diagram DDG adjacency matrix, respectively. The construction method of the adjacency matrix is that if two nodes have one edge in the association code attribute graph, the corresponding element in the corresponding adjacency matrix is 1.
In addition, the VecCPG vectorization model also needs to add a corresponding vulnerability information tag to the function in the training sample. The function associated with no leak hole is marked as '0' as a negative sample, and represents that no leak exists in the function; and marking the function associated with the real vulnerability as a positive sample as '1', representing that the function has the vulnerability, and forming a final vulnerability detection and positioning data set.
As shown in fig. 3, the structure of the vulnerability detection model based on the graph attention network includes: the system comprises three layers of multi-head graph attention layers, a linear layer, a Softmax function layer and an output layer which are sequentially connected; the three graph attention network layers access an AST adjacency matrix, a CFG adjacency matrix, and a DDG adjacency matrix, respectively. The input of the three-layer multi-head graph attention layer is an adjacent matrix which is used for updating the feature vectors and outputting the feature vectors to be updated. And the updated feature vector is output after the final classification prediction result is obtained through the linear layer and the Softmax layer, classification of the associated code attribute graph is completed, and vulnerability identification of the source code is realized.
As shown in fig. 4, the structure of the vulnerability localization model LDMGGAT based on the attention mechanism is the same as that of the vulnerability detection model DMGGAT of the attention network, and the localization process is as follows:
s6.1, inputting a characteristic matrix and an adjacency matrix into a vulnerability location model LDMGGAT based on an attention mechanism, wherein three multi-head graph attention layers are respectively connected into an AST adjacency matrix, a CFG adjacency matrix and a DDG adjacency matrix, and calculating to obtain a node attention value matrix output by each multi-head graph attention layer, namely the AST attention value matrix, the CFG attention value matrix and the DDG attention value matrix. And carrying out normalization processing on each attention value matrix. And on the basis of the normalized attention value matrix, calculating the sum of the attention values of the neighbor nodes corresponding to each node to the node, and calculating an average value to respectively obtain the AST average attention weight, the CFG average attention weight and the DDG average attention weight of the node. Then, the ratio of 1:1: the AST average attention weight, the CFG average attention weight and the DDG attention weight are added in a proportion of 1, and the attention weight of the vulnerability location model based on the attention mechanism to the node is obtained.
S6.2, classifying the nodes by adopting an IQR algorithm to realize vulnerability positioning of the source code; specifically, the method comprises the following steps:
s6.2.1, sequencing attention weights of all nodes, selecting a numerical value Q1 of a first quartile and a numerical value Q3 of a third quartile, and calculating a difference value between the Q3 and the Q1 to obtain a quartile distance IQR.
S6.2.2, setting a dynamic parameter k to indicate the tolerance of the abnormal value. According to the statistical principle, the normal range of the attention value of the node in the associated code attribute graph is (Q1-k multiplied by IQR, Q3+ k multiplied by IQR); the larger the value of k, the less the nodes that contribute more to the classification of the associated code attribute graph are filtered, and vice versa. Therefore, by adjusting the IQR abnormal value, the normal attention value interval can be obtained, and nodes corresponding to the normal attention value interval and the line number information of the nodes in the associated code attribute graph are searched for vulnerability positioning.
The above-described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments without any inventive step, are within the scope of protection of the invention.

Claims (6)

1. A static source code vulnerability detection and positioning method based on a graph neural network is characterized by comprising the following steps:
s1, acquiring a source code of a target program:
s2, performing lexical analysis and syntax analysis on the source codes of the target program in sequence, and then generating an abstract syntax tree according to a final analysis result;
s3, adding a control flow diagram, a control dependency diagram and a data dependency diagram on the basis of the abstract syntax tree to form a code attribute diagram;
s4, searching a node of the calling type in the code attribute graph obtained in the S3, and establishing a function calling graph according to the found node information of the calling type;
s5, adding the function call graph into the code attribute graph, and associating the control flow graph with the data dependency graph by using the function call graph to form an associated code attribute graph;
and S6, constructing a vulnerability detection and positioning model based on the graph neural network, and inputting the associated code attribute graph obtained in the step S5 into the model to complete vulnerability detection and positioning.
2. The static source code vulnerability detection and location method based on the graph neural network as claimed in claim 1, wherein the S4 adopts the following steps to establish the function call graph:
s4.1, searching a node a called by a function in a code attribute graph of the function f0 to obtain a function name f1, a parameter list and a parameter type called by the node;
s4.2, traversing all nodes with the in-degree of 0 in the code attribute graph, and matching the nodes with the function name, the parameter list and the parameter type called by the node a to obtain a calling function node and a called function node corresponding to the node a;
and S4.3, adding a function calling edge between the calling function node and the called function node obtained in the S4.2 for correlation, and establishing a function calling graph, wherein the direction of the function calling edge is that the calling function node points to the called function node in the correlation process.
3. The static source code vulnerability detection and location method based on graph neural network as claimed in claim 2, wherein the operation process of S4.2 is:
s4.2.1, recording the current node as v, comparing whether the function name f1 of the node a is the same as the function name f2 of the node v, and if so, converting to S4.2.2; if the difference is not equal, the operation is switched to S4.2.4;
s4.2.2, comparing whether the number of the parameter lists of the nodes a and v is the same, and if so, turning to S4.2.3; if the difference is not the same, the operation is switched to S4.2.4;
s4.2.3, sequentially comparing whether each parameter type in the parameter list is the same, and if all the parameter types are the same, converting to S4.2.5; if one or more parameters are different, turning to S4.2.4;
s4.2.4, continuously traversing to obtain the next function node and node information, marking as v, and converting to S4.2.1;
and S4.2.5, outputting a calling function node and a called function node.
4. The static source code vulnerability detection and positioning method based on the graph neural network as claimed in claim 1, wherein in S5, a control flow graph adjustment algorithm and a data dependency graph association algorithm are further adopted to adjust an associated code attribute graph:
s5.1, the adjustment process of the control flow graph adjustment algorithm on the associated code attribute graph is as follows:
(a1) Obtaining a calling function node a and a called function node b according to the function calling graph edge;
(a2) Searching edges which are related to the calling function node a and have a control flow graph relation, and according to the direction of the control flow graph edges, enabling the edges to reach a parent node set V1, a child node set V2 and an edge set E1 of the node a;
(a3) Searching an edge which is associated with the called function node b and has a control flow graph relation, and obtaining a child node set V3 of the node b according to the direction of the control flow graph edge;
(a4) Traversing the control flow graph edge in the code attribute graph of the called function node b to obtain a leaf node c in the control flow graph, namely a node with an in-degree of 1 and an out-degree of 0;
(a5) Sequentially connecting a father node set V1 of the calling function node a with a child node set V3 of the called function node b;
(a6) Sequentially connecting the node c of the called function node b with the child node set V2 of the calling function node a;
(a7) Deleting the edge set E1 of the calling function node a;
the adjustment process of the data dependency graph association algorithm on the association code attribute graph is as follows:
(b1) Obtaining a calling function node a and a called function node b according to the function calling graph edge;
(b2) Searching a child node set V1 of a calling function node a, namely a transferred parameter node set; if the father node of the calling function node is an operator node such as a value node and the like, recording the left brother node of the calling function node a as a node c;
(b3) Searching a child node set V3 of the called function node b, wherein the child node set is a parameter node set and other statement root nodes of the called function; if the called function has a return value, acquiring a node d corresponding to the return value;
(b5) Sequentially connecting a child node set V2 of a calling function node a with a child node set V3 of a called function node b according to the corresponding parameter type and parameter sequence;
(b6) And according to the type of the return value, comparing whether the return value node d of the called function is consistent with the sibling node c of the calling function or not, if so, adding a data dependency graph edge between the node c and the node d for association, wherein the direction of the data dependency edge points to the node d from the node c.
5. The static source code vulnerability detection and location method based on the graph neural network according to claim 1, wherein the vulnerability detection and location model based on the graph neural network constructed in the S6 comprises: the method comprises the following steps of (1) a VecCPG vectorization model, a vulnerability detection model based on a graph attention network and a vulnerability positioning model based on an attention mechanism;
the VecPCG vectorization model is input into the associated code attribute graph and used for vectorizing nodes and edges in the associated code attribute graph, and acquiring a feature matrix and an adjacency matrix according to a vectorization processing result, wherein the adjacency matrix comprises an AST adjacency matrix, a CFG adjacency matrix and a DDG adjacency matrix;
the vulnerability detection model based on the graph attention network has the input of a feature matrix and an adjacent matrix and is used for realizing vulnerability identification of a source code according to the input feature matrix and the adjacent matrix;
the vulnerability location model based on the attention mechanism has the input of a feature matrix and an adjacent matrix and is used for realizing vulnerability location of a source code according to the input feature matrix and the adjacent matrix.
6. The static source code vulnerability detection and location method based on graph neural network as claimed in claim 5, wherein the structure of the vulnerability detection model based on graph attention network comprises: the system comprises three layers of multi-head graph attention layers, a linear layer, a Softmax function layer and an output layer which are sequentially connected; the three graph attention network layers access the AST adjacency matrix, the CFG adjacency matrix, and the DDG adjacency matrix in sequence.
CN202211357260.5A 2022-11-01 2022-11-01 Static source code vulnerability detection and positioning method based on graph neural network Pending CN115935367A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211357260.5A CN115935367A (en) 2022-11-01 2022-11-01 Static source code vulnerability detection and positioning method based on graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211357260.5A CN115935367A (en) 2022-11-01 2022-11-01 Static source code vulnerability detection and positioning method based on graph neural network

Publications (1)

Publication Number Publication Date
CN115935367A true CN115935367A (en) 2023-04-07

Family

ID=86549652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211357260.5A Pending CN115935367A (en) 2022-11-01 2022-11-01 Static source code vulnerability detection and positioning method based on graph neural network

Country Status (1)

Country Link
CN (1) CN115935367A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117609005A (en) * 2023-10-19 2024-02-27 广东工业大学 Code similarity detection method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117609005A (en) * 2023-10-19 2024-02-27 广东工业大学 Code similarity detection method

Similar Documents

Publication Publication Date Title
CN109426722B (en) SQL injection defect detection method, system, equipment and storage medium
CN113641586B (en) Software source code defect detection method, system, electronic equipment and storage medium
CN112541180B (en) Software security vulnerability detection method based on grammatical features and semantic features
Xiaomeng et al. CPGVA: Code property graph based vulnerability analysis by deep learning
Bui et al. Autofocus: interpreting attention-based neural networks by code perturbation
CN112364352B (en) Method and system for detecting and recommending interpretable software loopholes
Racz et al. Correlated stochastic block models: Exact graph matching with applications to recovering communities
CN110704846B (en) Intelligent human-in-loop security vulnerability discovery method
CN114528221B (en) Software defect prediction method based on heterogeneous graph neural network
CN114238100A (en) Java vulnerability detection and positioning method based on GGNN and layered attention network
CN114491529A (en) Android malicious application program identification method based on multi-modal neural network
CN115935367A (en) Static source code vulnerability detection and positioning method based on graph neural network
Gezici et al. Explainable AI for software defect prediction with gradient boosting classifier
CN114897085A (en) Clustering method based on closed subgraph link prediction and computer equipment
CN114385512A (en) Software source code defect detection method and device
Zhang et al. CPVD: Cross Project Vulnerability Detection Based On Graph Attention Network And Domain Adaptation
CN114064487A (en) Code defect detection method
Do Xuan et al. A novel approach for software vulnerability detection based on intelligent cognitive computing
CN116702157B (en) Intelligent contract vulnerability detection method based on neural network
CN117272312A (en) Interpretive intelligent contract vulnerability detection and positioning method based on reinforcement learning
CN115859307A (en) Similar vulnerability detection method based on tree attention and weighted graph matching
CN106844218A (en) A kind of evolution influence collection Forecasting Methodology based on section of developing
KR20210142443A (en) Method and system for providing continuous adaptive learning over time for real time attack detection in cyberspace
CN116361816B (en) Intelligent contract vulnerability detection method, system, storage medium and equipment
CN117435246B (en) Code clone detection method based on Markov chain model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination