CN116628695A

CN116628695A - Vulnerability discovery method and device based on multitask learning

Info

Publication number: CN116628695A
Application number: CN202210125058.3A
Authority: CN
Inventors: 吴敬征; 武延军; 段旭; 杜梦男; 罗天悦; 杨牧天
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2023-08-22

Abstract

The invention discloses a vulnerability discovery method and device based on multi-task learning, wherein the method comprises the following steps: constructing a code attribute graph based on an abstract syntax tree, a control flow graph and a program dependency graph of source codes; separating the tree structure from the graph structure in the code attribute graph; encoding nodes in the tree structure to obtain an initial node vector representation of the abstract syntax tree; and respectively inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into a vulnerability identification neural network model and a vulnerability positioning neural network model to obtain a vulnerability identification result and a vulnerability positioning result. The invention provides a vulnerability discovery method based on multi-task learning, which aims to solve the problems that the accuracy of the conventional source code vulnerability discovery method based on learning needs to be improved and the detection granularity is thicker. The method and the device can realize the positioning of the loopholes while accurately identifying the loopholes, and improve the security loophole mining capability of loophole mining personnel.

Description

Vulnerability discovery method and device based on multitask learning

Technical Field

The invention belongs to the technical field of computers, and relates to a vulnerability discovery method and device based on multi-task learning.

Background

In recent years, the trend of software security is increasingly severe, and it is becoming more and more important to ensure the security of software. Vulnerabilities, which are one of the key factors affecting software security, should be given sufficient attention. According to the data released by the national vulnerability database, the number of vulnerabilities released annually before 2017 is less than 8000, and the number of vulnerabilities released annually between 2017 and 2020 is more than 14000, and the vulnerabilities released annually still show an ascending trend. At the same time, the size and complexity of software is increasing, the code amount reaches millions of levels, and the vulnerability situation is complex and diverse, which makes it impractical to manually identify vulnerabilities in software. In this situation, many studies have attempted to explore an automated, efficient vulnerability discovery method.

Source code vulnerability discovery is a common automated vulnerability discovery method based on static analysis, which discovers vulnerabilities directly from information in source code. The existing source code vulnerability discovery method can be mainly divided into a method based on code similarity and a method based on mode. The former digs vulnerabilities by comparing known vulnerability codes, which can only discover recurrent vulnerabilities caused by code clones. And the latter characterizes the loopholes according to the loopholes modes generated by expert definition or learning so as to realize loopholes mining. Among the pattern-based methods, rule-based methods and learning-based methods can be subdivided. The former relies on expert definition vulnerability rules and then rule matching is performed in different ways to thereby exploit vulnerabilities. The latter learns the vulnerability pattern from the data by a machine learning technique, so as to predict the vulnerability of the unknown code according to the pattern, thereby mining the vulnerability. Because learning-based methods can effectively solve the problems of reliance on expert knowledge and difficulty in detecting 0-day vulnerabilities, the methods are widely studied. However, the existing learning-based source code vulnerability mining method has a certain limitation in terms of code semantic modeling and detection granularity, so that the existing learning-based source code vulnerability mining method cannot meet the requirements of high accuracy and high practicability at the same time. On one hand, the existing method is incomplete in modeling the code semantic information, so that the code semantic information is difficult to cover semantic features of different types of loopholes, and accuracy is reduced when the code semantic information is faced with different types of loopholes. On the other hand, the detection granularity of the existing method is thicker, and clear vulnerability positions cannot be provided, so that subsequent manual verification is difficult, and the practicability is to be improved. The problems described above result in the inability of existing learning-based source code vulnerability discovery methods to efficiently discover vulnerabilities.

Disclosure of Invention

The invention aims to provide a vulnerability mining method and device based on multi-task learning, the method models and learns various semantic information of source codes by constructing an encoder, calculates semantic vector representation of the source codes, and performs joint training on two tasks under the condition of sharing encoder parameters by means of vulnerability identification and vulnerability positioning, so that the vulnerability can be positioned while accurately identifying the vulnerability, and the security vulnerability mining capability of vulnerability mining personnel is improved.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a vulnerability discovery method based on multitasking learning comprises the following steps:

constructing a code attribute graph based on an abstract syntax tree, a control flow graph and a program dependency graph of source codes;

separating the tree structure from the graph structure in the code attribute graph;

encoding nodes in the tree structure to obtain an initial node vector representation of the abstract syntax tree;

inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into a vulnerability identification neural network model to obtain a vulnerability identification result; wherein, the vulnerability recognition neural network model comprises: a first code semantic encoder and a first classification module;

the first code semantic encoder is used for calculating a source code vector representation according to an initial node vector representation, a tree structure and a graph structure of the abstract syntax tree;

the first classification module is used for calculating a vulnerability identification result according to the source code vector representation;

inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into a vulnerability positioning neural network model to obtain a vulnerability positioning result; wherein, the vulnerability localization neural network model comprises: the second code semantic encoder, the attention layer and the second classification module;

the second code semantic encoder is used for calculating the final vector representation of the nodes in the graph structure according to the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree;

the attention layer is used for giving attention weight to the final vector representation of each node in the graph structure and carrying out attention calculation;

the second classification module is used for calculating a vulnerability positioning result according to the attention calculation result.

Further, a code attribute map is constructed by:

1) Generating an attribute graph G of an abstract syntax tree _A ＝(V _A ，E _A ，λ _A ，μ _A ) Wherein node set V _A Node v in (a) _A Edge set E, being a node of an abstract syntax tree _A Edge e of (a) _A To abstract the edges of the syntax tree, the function lambda _A Edge e _A Marked as abstract syntax tree edge, function μ _A For node v _A Endowing code statement type attribute and node code attribute;

2) Generating an attribute graph gc= (V) of a control flow graph _C ，E _C ，λ _C ) Wherein node set V _C Node v in (a) _C Edge set E is a node representing sentences and predicates in the abstract syntax tree _C Edge e of (a) _C Corresponding to the jump edge of the control flow graph, the function lambda _C For edge e _C Allocating jump condition marks;

3) Generating an attribute graph G of a program dependency graph _P ＝(V _P ，E _P ，λ _P ，μ _P ) Wherein node set V _P Node v in (a) _P Edge set E, being a node of an abstract syntax tree _P Edge e of (a) _P Corresponding to the programDependent edges of the graph, function lambda _P For edge e _P Assigning a dependency label, the dependency label comprising: control dependence or data dependence, function μ _P Edge e marked as data dependent for each piece _P Indicating the respective symbol relied upon, or the edge e marked as control dependent for each _P A predicate state indicating control dependencies;

4) Combined attribute graph G _A Attribute map G _C And attribute map G _P A code attribute graph g= (V, E, λ, μ) is obtained, where v=v _A ，E＝E _A ∪E _C ∪E _P ，λ＝λ _A ∪λ _C ∪λ _P ，μ＝μ _A ∪μ _P 。

3. The method of claim 1, wherein the code attribute graph G' = (V) after separation _T ,E _T ,V _G ,E _G Lambda, mu), wherein Node set in abstract syntax tree representing the ith sentence,/i>Edge set in abstract syntax tree representing the ith sentence, +.> And->Abstract the root node of the grammar tree for the ith statement, |V _G The i is equal to the number of statements in the source code.

Further, the encoding the nodes in the tree structure includes:

1) Vectorizing the code statement type attribute by using a PACE algorithm to obtain a code statement type representation;

2) Performing vectorization coding on the node code attribute by using a word2vec algorithm to obtain node code representation;

3) And splicing the code statement type representation and the node code representation to obtain an initial node vector representation of the abstract syntax tree.

Further, a source code vector representation is calculated by:

1) Updating the vector representations of all nodes in the tree structure according to the initial vector representation of the abstract syntax tree;

2) Aggregating vector representations of all nodes in the updated tree structure to obtain sentence vector representations;

3) Initializing the vector representation of the nodes in the graph structure according to the statement vector representation, and updating the vector representation of the nodes in the graph structure by using the convolution based on the graph to obtain the final vector representation of the nodes in the graph structure;

4) The final vector representations of the nodes in the graph structure are aggregated to obtain a vector representation of the source code.

Further, the first classification module includes: the system comprises a full connection layer and a softmax layer, wherein the full connection layer is used for carrying out linear transformation and nonlinear mapping on vector representation of a program source code, and the softmax layer is used for carrying out two classification according to a calculation result of the full connection layer to obtain a vulnerability identification result.

Further, training the vulnerability identification neural network model and the vulnerability localization neural network model by:

1) Sharing parameters of a tree embedding module, a tree pooling module and a graph embedding module in the first code semantic encoder and the second code semantic encoder;

2) The cross entropy loss is used as a vulnerability identification task loss and a vulnerability positioning task loss respectively, and an average value of the vulnerability identification task loss and the vulnerability positioning task loss is defined as a joint loss;

3) And optimizing the joint loss by using an Adam optimizer, and training a vulnerability identification neural network model and a vulnerability positioning neural network model.

A vulnerability discovery apparatus based on multitasking learning, comprising:

the construction module is used for constructing a code attribute graph based on the abstract syntax tree, the control flow graph and the program dependency graph of the source code;

the separation module is used for separating the tree structure from the graph structure in the code attribute graph;

the coding module is used for coding the nodes in the tree structure to obtain an initial node vector representation of the abstract syntax tree;

the identifying module is used for inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into the vulnerability identifying neural network model to obtain a vulnerability identifying result; wherein, the vulnerability recognition neural network model comprises: a first code semantic encoder and a first classification module;

the positioning module is used for inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into the vulnerability positioning neural network model to obtain a vulnerability positioning result; wherein, the vulnerability localization neural network model comprises: the second code semantic encoder, the attention layer and the second classification module;

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above method when run.

An electronic device comprising a memory and a processor, wherein the memory stores a program for performing the above-described method.

Compared with the prior art, the invention provides a vulnerability mining method based on multi-task learning, aiming at solving the problems that the accuracy of the existing source code vulnerability mining method based on learning is to be improved and the detection granularity is thicker. According to the method, the encoder is constructed to model and learn various semantic information of the source code, semantic vector representation of the source code is calculated, and under the condition that encoder parameters are shared by two tasks of vulnerability identification and vulnerability positioning, the two tasks are trained in a combined mode, so that the vulnerability is positioned while the vulnerability is accurately identified, and the safety vulnerability mining capability of vulnerability mining personnel is improved.

Drawings

FIG. 1 is a flow chart of a vulnerability discovery method based on multitasking learning.

FIG. 2 is a schematic diagram of a final generated code attribute map.

Fig. 3 is a block diagram of a vulnerability recognition neural network model.

FIG. 4 is a block diagram of a vulnerability localization neural network model.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is apparent that the embodiments described are merely specific embodiments of the present invention, and not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.

The general flow of the vulnerability discovery method of the invention is shown in figure 1, and mainly comprises the following steps:

1) Code attribute graphs (CPGs) are generated and preprocessed for source code, generating code attribute graphs with separate tree and graph structures, as shown in FIG. 2. CPG is a federated data structure that is composed of a fusion of Abstract Syntax Trees (AST), control Flow Graphs (CFG) and Program Dependency Graphs (PDG). AST reflects the syntax structure inside the statements, CFG reflects the control flow between the statements, PDG reflects the data dependencies and control dependencies between the statements. CPG can effectively cover semantic information required by vulnerability detection.

2) And encoding the AST node characteristics to obtain an initial vector representation of the AST node. The initial vector representation is a vector for representing node semantic information. Two fields, a type field and a code field, respectively, are included in the initial vector representation. The type field is vectorized using PACE and the code field is vectorized using word2 vec. And finally, splicing the vectorization codes of the two fields to obtain a final initial vector representation.

3) A vulnerability recognition neural network model is constructed as shown in fig. 2. The vulnerability recognition neural network model consists of a code semantic encoder, a full connection layer and a softmax layer. The code semantic encoder consists of a tree embedding module, a tree pooling module, a graph embedding module and a graph pooling module. The tree embedding module and the tree pooling module are responsible for learning vector representations of AST in the CPG to obtain sentence-level vector representations for initializing the vector representations of the CFG and PDG. The graph embedding module and the graph pooling module are responsible for learning the vector representations of the CFG and the PDG in the CPG to obtain the vector representation of the source code. The full connectivity layer and softmax layer then implement a two-classification of whether the source code has vulnerabilities based on the vector representation of the source code.

4) A vulnerability localization neural network model is constructed as shown in fig. 3. The vulnerability localization neural network model consists of a code semantic encoder, an attention layer, a full connection layer and a softmax layer. Wherein the code semantic encoder differs from 3) in that it does not have a pooling module. The attention layer is used to learn different attention weights of different nodes. The fully connected layer and the softmax layer then implement a classification based on the vector representation of each graph node to determine whether the corresponding code line is a vulnerability location.

5) Inputting AST initial vector representation into a vulnerability identification neural network model and a vulnerability positioning neural network model, and carrying out joint training on the two models to realize multi-task learning.

6) Predicting whether the given source code has the loopholes by using the trained loophole recognition neural network model, and predicting the specific line numbers of the loopholes in the given source code by using the trained loophole positioning neural network model.

In an example, step 1) includes the steps of:

a) AST is generated for program source code and converted into attribute graph, which is marked as G _A . In a specific way, the attribute map is assumed to be G _A ＝(V _A ，E _A ，λ _A ，μ _A ) Wherein node set V _A The nodes in (a) are given by the nodes of the original abstract syntax tree. Edge set E _A The edges in (a) are given by the edges of the original abstract syntax tree. Furthermore, the function lambda _A Each edge is labeled AST edge. Function mu _A Each node is assigned a type attribute and a code attribute. Wherein the attribute value of the type attribute is a string type, and corresponds to the sentence type of the code represented by the node, for example, "CallExpression" represents a function call sentence, and "conditional expression" represents a conditional sentence. The attribute value of the code attribute is also a character string type, corresponding to the code represented by the node.

b) CFG is generated for program source code and converted into attribute graph, which is marked as G _C . In a specific way, the attribute map is assumed to be G _C ＝(V _C ，E _C ，λ _C ) Wherein node set V _C Is V _A Corresponding to the nodes representing statements and predicates in the AST. Furthermore, the edge marking function λ _C ：E _C →∑ _C From the marked symbol set Σ _C Each edge is assigned a flag in = { true, false, }, to indicate the condition of the control flow graph jump.

c) PDG is generated for the program source code and is converted into an attribute graph which is marked as G _P . In a specific way, the attribute map is assumed to be G _P ＝(V _P ，E _P ，λ _P ，μ _P ) Wherein node set V _P ＝V _C Edge set E _P Corresponding to the edges of the original program dependency graph. Furthermore, the edge marking function λ _P ：E _P →∑ _P Slave marksSymbol set sigma _P Each edge is assigned a flag in = { C, D } to indicate a control dependency or a data dependency. Function mu _P A symbol attribute is assigned to each data-dependent edge to indicate the corresponding symbol that depends on, and a condition attribute is assigned to each control-dependent edge to indicate the predicate state of the control dependency, e.g., true or false.

d) Will G _A ，G _C And G _P CPG was combined and designated G. Specifically, assume that the code attribute map is g= (V, E, λ, μ), where v=v _A ，E＝E _A ∪E _C ∪E _P ，λ＝λ _A ∪λ _C ∪λ _P And μ=μ _A ∪μ _P 。

e) The tree structure in the code attribute graph, referring to AST, and the graph structure, referring to CFG and PDG, are separated. CPG= (V) after separation _T ，E _T ，V _G ，E _G Lambda, mu), wherein Node set in AST representing the ith sentence,/->Representing the set of edges in the AST of the ith statement. />And->Is the root node of the i-th statement AST. V (V) _G I is equal to the number of statements in the program source code. The meaning of the expression is that each statement in the program source code has only one node in the graph structure, the edge E of the graph structure _G Representing different sentencesControl flow, data dependencies, and control dependencies between. The points of each graph structure correspond to a tree structure representing the grammatical structure within the sentence.

In an example, step 2) includes the steps of:

a) The type field is vectorized encoded using the PACE algorithm. Assuming that the character string to be encoded is S, the encoding result of the PACE algorithm is calculated by the following equation:wherein s is ₁ ，s ₂ ，...，s _k Is the constituent character of S, S ₁ ，s ₂ ，...，s _k E S. onehot is a one-hot encoding algorithm well known in the art.

b) The code field is vectorized encoded using word2vec algorithm. A word2vec model is trained by extracting a word (token) sequence of a program source code, and then a vector output by the model is used as a code of the token. Since the code represented by the node may be made up of multiple token. In this case, the average value of each token code vector may be taken as the code vector of the code field. For example, the eigenvectors of memcpy (buf, str, len) are the average of the eigenvectors of 8 token, i.e., { 'memcpy','(', 'buf',',' str ',', ',' len ',') }.

c) And splicing the vectorization codes of the two fields to obtain the final AST initial vector representation. Initial feature vector of node nCan be formally expressed as +>Wherein, the I represents vector stitching,and->Feature vectors representing the type field and the node field, respectively. />Feature vectors representing token. K is a token contained in the code field.

In an example, in step 3), the specific technical scheme of the vulnerability recognition neural network model is as follows:

a) The vulnerability recognition neural network model consists of a code semantic encoder, a full connection layer and softmax.

b) The code semantic encoder consists of a tree embedding module, a tree pooling module, a graph embedding module and a graph pooling module. The tree embedding module and the tree pooling module are responsible for learning vector representations of ASTs in the CPG to obtain the sentence-level vector representations. The graph embedding module and the graph pooling module are responsible for learning the vector representations of the CFG and the PDG in the CPG to obtain the vector representation of the source code.

c) A tree embedding module in the code semantic encoder updates the vector representations of all nodes in the tree structure using tree-based convolution. At each convolution, the vector representation of the parent node is updated from the vector representations of child nodes in the subtree. Assume that the vector representation of the parent node isThe vector representation of the child node is +.>New vector representation of parent node +.>Calculated by ∈>Wherein W is _i The weight matrix of the node i is represented by b, the bias term and sigma, the activation function.

d) A tree pooling module in the code semantic encoder aggregates the vector representations of all nodes in the tree structure into a vector representation of the entire tree structure and takes it as a vector representation of the statement represented by the current tree result. Specifically, the tree pooling module extracts and concatenates together the maximum values of the dimensions in all nodes of the tree structure, resulting in an aggregated vector representation that is used to initialize the vector representation of the graph structure.

e) The graph embedding module in the code semantic encoder updates the vector representations of all nodes in the graph structure using graph-based convolution. At each convolution, the vector representation of the parent node is updated from the vector representations of the neighboring nodes. Assuming that the vector representation of a node isThe vector representation of its one-hop neighbors is +.>New vector representation of parent node +.>Calculated by ∈>Wherein W is a weight matrix, alpha _0i Is the attention weight between the center node and the inode, σ is the activation function.

f) A pooling module in the code semantic encoder aggregates the vector representations of all nodes in the graph structure into a vector representation of the entire graph structure and takes it as a vector representation of the program source code represented by the current graph structure. Specifically, the graph pooling module extracts and splices together the maximum values of each dimension in all nodes in the graph structure, thereby obtaining an aggregated vector representation.

g) The fully connected layer performs linear transformation and nonlinear mapping on the vector representation of the program source code. Assuming that the vector of program source code output in the code semantic encoder is denoted as x, the full-join layer output is logits=σ (wx+b), where w is the weight matrix, b is the bias term, and σ is the activation function.

h) The softmax layer performs two classifications according to the calculation result of the full connection layer. Assuming the full connection layer output is logits, the softmax layer output isThe final classification result is the class in which the larger value of softmax is located. The classification result represents whether a bug exists in the program source code.

In an example, in step 4), the specific technical scheme of the vulnerability localization neural network model is as follows:

a) The vulnerability localization neural network model consists of a code semantic encoder, an attention layer, a full connection layer and a softmax.

b) The code semantic encoder consists of a tree embedding module, a tree pooling module and a graph embedding module, and the structures of the three modules are the same as those in the vulnerability identification neural network model.

c) The attention layer gives different attention weights to different sentences. When classifying statement x, the attention weight of statement x and all other semantics is calculated using statement x as a query term.

d) The fully connected layer performs linear transformation and nonlinear mapping on the vector representation of the program source code. Assuming that the vector of program source code output in the code semantic encoder is denoted as x, the full-join layer output is logits=σ (wx+b), where W is the weight matrix, b is the bias term, and σ is the activation function.

e) The softmax layer performs two classifications according to the calculation result of the full connection layer. Assuming the full connection layer output is logits, the softmax layer output isThe final classification result is the class in which the larger value of softmax is located. The classification result represents whether the current sentence is a vulnerability location.

In an example, in step 5), the specific technical scheme of the joint training is as follows:

a) The parameters of the tree embedding module, the tree pooling module and the graph embedding module in the semantic encoder are shared.

b) The cross entropy loss is used as the loss of the vulnerability identification task and the vulnerability localization task respectively, and the average value of the two losses is defined as the joint loss.

c) The neural network model is trained by optimizing the joint loss by using an Adam optimizer.

In an example, a trained vulnerability recognition neural network model is used to predict whether a given source code has a vulnerability, and a trained vulnerability localization neural network model is used to predict a specific line number where the vulnerability in the given source code is located.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.

Claims

1. A vulnerability discovery method based on multitasking learning comprises the following steps:

2. The method of claim 1, wherein the code attribute map is constructed by:

2) Generating an attribute map G of a control flow graph _C ＝(V _C ，E _C ，λ _C ) Wherein node set V _C Node v in (a) _C Edge set E is a node representing sentences and predicates in the abstract syntax tree _C Edge e of (a) _C Corresponding to the jump edge of the control flow graph, the function lambda _C For edge e _C Allocating jump condition marks;

3) Generating an attribute graph G of a program dependency graph _P ＝(V _P ，E _P ，λ _P ，μ _P ) Wherein node set V _P Node v in (a) _P Edge set E, being a node of an abstract syntax tree _P Edge e of (a) _P Function lambda corresponds to the dependency edge of the program dependency graph _P For edge e _P Assigning a dependency label, the dependency label comprising: control dependence or data dependence, function μ _P Edge e marked as data dependent for each piece _P Indicating the respective symbol on which it depends, or marking each as controlDependent edge e _P A predicate state indicating control dependencies;

3. The method of claim 1, wherein the code attribute graph G' = (V) after separation _T ，E _T ，V _G ，E _G Lambda, mu), wherein Node set in abstract syntax tree representing the ith sentence,/i>Edge set in abstract syntax tree representing the ith sentence, +.> And->Abstract the root node of the grammar tree for the ith statement, |V _G The i is equal to the number of statements in the source code.

4. The method of claim 1, wherein the encoding the nodes in the tree structure comprises:

5. The method of claim 1, wherein the source code vector representation is calculated by:

6. The method of claim 1, wherein the first classification module comprises: the system comprises a full connection layer and a softmax layer, wherein the full connection layer is used for carrying out linear transformation and nonlinear mapping on vector representation of a program source code, and the softmax layer is used for carrying out two classification according to a calculation result of the full connection layer to obtain a vulnerability identification result.

7. The method of any of claims 1 to 6, wherein the vulnerability identification neural network model and the vulnerability localization neural network model are trained by:

8. A vulnerability discovery apparatus based on multitasking learning, comprising:

9. A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of any of claims 1-7 when run.

10. A computer device comprising a memory, in which a computer program is stored, and a processor arranged to run the computer program to perform the method of any of claims 1-7.