CN116628695A - Vulnerability discovery method and device based on multitask learning - Google Patents

Vulnerability discovery method and device based on multitask learning Download PDF

Info

Publication number
CN116628695A
CN116628695A CN202210125058.3A CN202210125058A CN116628695A CN 116628695 A CN116628695 A CN 116628695A CN 202210125058 A CN202210125058 A CN 202210125058A CN 116628695 A CN116628695 A CN 116628695A
Authority
CN
China
Prior art keywords
vulnerability
graph
code
vector representation
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210125058.3A
Other languages
Chinese (zh)
Inventor
吴敬征
武延军
段旭
杜梦男
罗天悦
杨牧天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202210125058.3A priority Critical patent/CN116628695A/en
Publication of CN116628695A publication Critical patent/CN116628695A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a vulnerability discovery method and device based on multi-task learning, wherein the method comprises the following steps: constructing a code attribute graph based on an abstract syntax tree, a control flow graph and a program dependency graph of source codes; separating the tree structure from the graph structure in the code attribute graph; encoding nodes in the tree structure to obtain an initial node vector representation of the abstract syntax tree; and respectively inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into a vulnerability identification neural network model and a vulnerability positioning neural network model to obtain a vulnerability identification result and a vulnerability positioning result. The invention provides a vulnerability discovery method based on multi-task learning, which aims to solve the problems that the accuracy of the conventional source code vulnerability discovery method based on learning needs to be improved and the detection granularity is thicker. The method and the device can realize the positioning of the loopholes while accurately identifying the loopholes, and improve the security loophole mining capability of loophole mining personnel.

Description

Vulnerability discovery method and device based on multitask learning
Technical Field
The invention belongs to the technical field of computers, and relates to a vulnerability discovery method and device based on multi-task learning.
Background
In recent years, the trend of software security is increasingly severe, and it is becoming more and more important to ensure the security of software. Vulnerabilities, which are one of the key factors affecting software security, should be given sufficient attention. According to the data released by the national vulnerability database, the number of vulnerabilities released annually before 2017 is less than 8000, and the number of vulnerabilities released annually between 2017 and 2020 is more than 14000, and the vulnerabilities released annually still show an ascending trend. At the same time, the size and complexity of software is increasing, the code amount reaches millions of levels, and the vulnerability situation is complex and diverse, which makes it impractical to manually identify vulnerabilities in software. In this situation, many studies have attempted to explore an automated, efficient vulnerability discovery method.
Source code vulnerability discovery is a common automated vulnerability discovery method based on static analysis, which discovers vulnerabilities directly from information in source code. The existing source code vulnerability discovery method can be mainly divided into a method based on code similarity and a method based on mode. The former digs vulnerabilities by comparing known vulnerability codes, which can only discover recurrent vulnerabilities caused by code clones. And the latter characterizes the loopholes according to the loopholes modes generated by expert definition or learning so as to realize loopholes mining. Among the pattern-based methods, rule-based methods and learning-based methods can be subdivided. The former relies on expert definition vulnerability rules and then rule matching is performed in different ways to thereby exploit vulnerabilities. The latter learns the vulnerability pattern from the data by a machine learning technique, so as to predict the vulnerability of the unknown code according to the pattern, thereby mining the vulnerability. Because learning-based methods can effectively solve the problems of reliance on expert knowledge and difficulty in detecting 0-day vulnerabilities, the methods are widely studied. However, the existing learning-based source code vulnerability mining method has a certain limitation in terms of code semantic modeling and detection granularity, so that the existing learning-based source code vulnerability mining method cannot meet the requirements of high accuracy and high practicability at the same time. On one hand, the existing method is incomplete in modeling the code semantic information, so that the code semantic information is difficult to cover semantic features of different types of loopholes, and accuracy is reduced when the code semantic information is faced with different types of loopholes. On the other hand, the detection granularity of the existing method is thicker, and clear vulnerability positions cannot be provided, so that subsequent manual verification is difficult, and the practicability is to be improved. The problems described above result in the inability of existing learning-based source code vulnerability discovery methods to efficiently discover vulnerabilities.
Disclosure of Invention
The invention aims to provide a vulnerability mining method and device based on multi-task learning, the method models and learns various semantic information of source codes by constructing an encoder, calculates semantic vector representation of the source codes, and performs joint training on two tasks under the condition of sharing encoder parameters by means of vulnerability identification and vulnerability positioning, so that the vulnerability can be positioned while accurately identifying the vulnerability, and the security vulnerability mining capability of vulnerability mining personnel is improved.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a vulnerability discovery method based on multitasking learning comprises the following steps:
constructing a code attribute graph based on an abstract syntax tree, a control flow graph and a program dependency graph of source codes;
separating the tree structure from the graph structure in the code attribute graph;
encoding nodes in the tree structure to obtain an initial node vector representation of the abstract syntax tree;
inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into a vulnerability identification neural network model to obtain a vulnerability identification result; wherein, the vulnerability recognition neural network model comprises: a first code semantic encoder and a first classification module;
the first code semantic encoder is used for calculating a source code vector representation according to an initial node vector representation, a tree structure and a graph structure of the abstract syntax tree;
the first classification module is used for calculating a vulnerability identification result according to the source code vector representation;
inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into a vulnerability positioning neural network model to obtain a vulnerability positioning result; wherein, the vulnerability localization neural network model comprises: the second code semantic encoder, the attention layer and the second classification module;
the second code semantic encoder is used for calculating the final vector representation of the nodes in the graph structure according to the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree;
the attention layer is used for giving attention weight to the final vector representation of each node in the graph structure and carrying out attention calculation;
the second classification module is used for calculating a vulnerability positioning result according to the attention calculation result.
Further, a code attribute map is constructed by:
1) Generating an attribute graph G of an abstract syntax tree A =(V A ,E A ,λ A ,μ A ) Wherein node set V A Node v in (a) A Edge set E, being a node of an abstract syntax tree A Edge e of (a) A To abstract the edges of the syntax tree, the function lambda A Edge e A Marked as abstract syntax tree edge, function μ A For node v A Endowing code statement type attribute and node code attribute;
2) Generating an attribute graph gc= (V) of a control flow graph C ,E C ,λ C ) Wherein node set V C Node v in (a) C Edge set E is a node representing sentences and predicates in the abstract syntax tree C Edge e of (a) C Corresponding to the jump edge of the control flow graph, the function lambda C For edge e C Allocating jump condition marks;
3) Generating an attribute graph G of a program dependency graph P =(V P ,E P ,λ P ,μ P ) Wherein node set V P Node v in (a) P Edge set E, being a node of an abstract syntax tree P Edge e of (a) P Corresponding to the programDependent edges of the graph, function lambda P For edge e P Assigning a dependency label, the dependency label comprising: control dependence or data dependence, function μ P Edge e marked as data dependent for each piece P Indicating the respective symbol relied upon, or the edge e marked as control dependent for each P A predicate state indicating control dependencies;
4) Combined attribute graph G A Attribute map G C And attribute map G P A code attribute graph g= (V, E, λ, μ) is obtained, where v=v A ,E=E A ∪E C ∪E P ,λ=λ A ∪λ C ∪λ P ,μ=μ A ∪μ P
3. The method of claim 1, wherein the code attribute graph G' = (V) after separation T ,E T ,V G ,E G Lambda, mu), wherein Node set in abstract syntax tree representing the ith sentence,/i>Edge set in abstract syntax tree representing the ith sentence, +.> And->Abstract the root node of the grammar tree for the ith statement, |V G The i is equal to the number of statements in the source code.
Further, the encoding the nodes in the tree structure includes:
1) Vectorizing the code statement type attribute by using a PACE algorithm to obtain a code statement type representation;
2) Performing vectorization coding on the node code attribute by using a word2vec algorithm to obtain node code representation;
3) And splicing the code statement type representation and the node code representation to obtain an initial node vector representation of the abstract syntax tree.
Further, a source code vector representation is calculated by:
1) Updating the vector representations of all nodes in the tree structure according to the initial vector representation of the abstract syntax tree;
2) Aggregating vector representations of all nodes in the updated tree structure to obtain sentence vector representations;
3) Initializing the vector representation of the nodes in the graph structure according to the statement vector representation, and updating the vector representation of the nodes in the graph structure by using the convolution based on the graph to obtain the final vector representation of the nodes in the graph structure;
4) The final vector representations of the nodes in the graph structure are aggregated to obtain a vector representation of the source code.
Further, the first classification module includes: the system comprises a full connection layer and a softmax layer, wherein the full connection layer is used for carrying out linear transformation and nonlinear mapping on vector representation of a program source code, and the softmax layer is used for carrying out two classification according to a calculation result of the full connection layer to obtain a vulnerability identification result.
Further, training the vulnerability identification neural network model and the vulnerability localization neural network model by:
1) Sharing parameters of a tree embedding module, a tree pooling module and a graph embedding module in the first code semantic encoder and the second code semantic encoder;
2) The cross entropy loss is used as a vulnerability identification task loss and a vulnerability positioning task loss respectively, and an average value of the vulnerability identification task loss and the vulnerability positioning task loss is defined as a joint loss;
3) And optimizing the joint loss by using an Adam optimizer, and training a vulnerability identification neural network model and a vulnerability positioning neural network model.
A vulnerability discovery apparatus based on multitasking learning, comprising:
the construction module is used for constructing a code attribute graph based on the abstract syntax tree, the control flow graph and the program dependency graph of the source code;
the separation module is used for separating the tree structure from the graph structure in the code attribute graph;
the coding module is used for coding the nodes in the tree structure to obtain an initial node vector representation of the abstract syntax tree;
the identifying module is used for inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into the vulnerability identifying neural network model to obtain a vulnerability identifying result; wherein, the vulnerability recognition neural network model comprises: a first code semantic encoder and a first classification module;
the first code semantic encoder is used for calculating a source code vector representation according to an initial node vector representation, a tree structure and a graph structure of the abstract syntax tree;
the first classification module is used for calculating a vulnerability identification result according to the source code vector representation;
the positioning module is used for inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into the vulnerability positioning neural network model to obtain a vulnerability positioning result; wherein, the vulnerability localization neural network model comprises: the second code semantic encoder, the attention layer and the second classification module;
the second code semantic encoder is used for calculating the final vector representation of the nodes in the graph structure according to the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree;
the attention layer is used for giving attention weight to the final vector representation of each node in the graph structure and carrying out attention calculation;
the second classification module is used for calculating a vulnerability positioning result according to the attention calculation result.
A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above method when run.
An electronic device comprising a memory and a processor, wherein the memory stores a program for performing the above-described method.
Compared with the prior art, the invention provides a vulnerability mining method based on multi-task learning, aiming at solving the problems that the accuracy of the existing source code vulnerability mining method based on learning is to be improved and the detection granularity is thicker. According to the method, the encoder is constructed to model and learn various semantic information of the source code, semantic vector representation of the source code is calculated, and under the condition that encoder parameters are shared by two tasks of vulnerability identification and vulnerability positioning, the two tasks are trained in a combined mode, so that the vulnerability is positioned while the vulnerability is accurately identified, and the safety vulnerability mining capability of vulnerability mining personnel is improved.
Drawings
FIG. 1 is a flow chart of a vulnerability discovery method based on multitasking learning.
FIG. 2 is a schematic diagram of a final generated code attribute map.
Fig. 3 is a block diagram of a vulnerability recognition neural network model.
FIG. 4 is a block diagram of a vulnerability localization neural network model.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is apparent that the embodiments described are merely specific embodiments of the present invention, and not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.
The general flow of the vulnerability discovery method of the invention is shown in figure 1, and mainly comprises the following steps:
1) Code attribute graphs (CPGs) are generated and preprocessed for source code, generating code attribute graphs with separate tree and graph structures, as shown in FIG. 2. CPG is a federated data structure that is composed of a fusion of Abstract Syntax Trees (AST), control Flow Graphs (CFG) and Program Dependency Graphs (PDG). AST reflects the syntax structure inside the statements, CFG reflects the control flow between the statements, PDG reflects the data dependencies and control dependencies between the statements. CPG can effectively cover semantic information required by vulnerability detection.
2) And encoding the AST node characteristics to obtain an initial vector representation of the AST node. The initial vector representation is a vector for representing node semantic information. Two fields, a type field and a code field, respectively, are included in the initial vector representation. The type field is vectorized using PACE and the code field is vectorized using word2 vec. And finally, splicing the vectorization codes of the two fields to obtain a final initial vector representation.
3) A vulnerability recognition neural network model is constructed as shown in fig. 2. The vulnerability recognition neural network model consists of a code semantic encoder, a full connection layer and a softmax layer. The code semantic encoder consists of a tree embedding module, a tree pooling module, a graph embedding module and a graph pooling module. The tree embedding module and the tree pooling module are responsible for learning vector representations of AST in the CPG to obtain sentence-level vector representations for initializing the vector representations of the CFG and PDG. The graph embedding module and the graph pooling module are responsible for learning the vector representations of the CFG and the PDG in the CPG to obtain the vector representation of the source code. The full connectivity layer and softmax layer then implement a two-classification of whether the source code has vulnerabilities based on the vector representation of the source code.
4) A vulnerability localization neural network model is constructed as shown in fig. 3. The vulnerability localization neural network model consists of a code semantic encoder, an attention layer, a full connection layer and a softmax layer. Wherein the code semantic encoder differs from 3) in that it does not have a pooling module. The attention layer is used to learn different attention weights of different nodes. The fully connected layer and the softmax layer then implement a classification based on the vector representation of each graph node to determine whether the corresponding code line is a vulnerability location.
5) Inputting AST initial vector representation into a vulnerability identification neural network model and a vulnerability positioning neural network model, and carrying out joint training on the two models to realize multi-task learning.
6) Predicting whether the given source code has the loopholes by using the trained loophole recognition neural network model, and predicting the specific line numbers of the loopholes in the given source code by using the trained loophole positioning neural network model.
In an example, step 1) includes the steps of:
a) AST is generated for program source code and converted into attribute graph, which is marked as G A . In a specific way, the attribute map is assumed to be G A =(V A ,E A ,λ A ,μ A ) Wherein node set V A The nodes in (a) are given by the nodes of the original abstract syntax tree. Edge set E A The edges in (a) are given by the edges of the original abstract syntax tree. Furthermore, the function lambda A Each edge is labeled AST edge. Function mu A Each node is assigned a type attribute and a code attribute. Wherein the attribute value of the type attribute is a string type, and corresponds to the sentence type of the code represented by the node, for example, "CallExpression" represents a function call sentence, and "conditional expression" represents a conditional sentence. The attribute value of the code attribute is also a character string type, corresponding to the code represented by the node.
b) CFG is generated for program source code and converted into attribute graph, which is marked as G C . In a specific way, the attribute map is assumed to be G C =(V C ,E C ,λ C ) Wherein node set V C Is V A Corresponding to the nodes representing statements and predicates in the AST. Furthermore, the edge marking function λ C :E C →∑ C From the marked symbol set Σ C Each edge is assigned a flag in = { true, false, }, to indicate the condition of the control flow graph jump.
c) PDG is generated for the program source code and is converted into an attribute graph which is marked as G P . In a specific way, the attribute map is assumed to be G P =(V P ,E P ,λ P ,μ P ) Wherein node set V P =V C Edge set E P Corresponding to the edges of the original program dependency graph. Furthermore, the edge marking function λ P :E P →∑ P Slave marksSymbol set sigma P Each edge is assigned a flag in = { C, D } to indicate a control dependency or a data dependency. Function mu P A symbol attribute is assigned to each data-dependent edge to indicate the corresponding symbol that depends on, and a condition attribute is assigned to each control-dependent edge to indicate the predicate state of the control dependency, e.g., true or false.
d) Will G A ,G C And G P CPG was combined and designated G. Specifically, assume that the code attribute map is g= (V, E, λ, μ), where v=v A ,E=E A ∪E C ∪E P ,λ=λ A ∪λ C ∪λ P And μ=μ A ∪μ P
e) The tree structure in the code attribute graph, referring to AST, and the graph structure, referring to CFG and PDG, are separated. CPG= (V) after separation T ,E T ,V G ,E G Lambda, mu), wherein Node set in AST representing the ith sentence,/->Representing the set of edges in the AST of the ith statement. />And->Is the root node of the i-th statement AST. V (V) G I is equal to the number of statements in the program source code. The meaning of the expression is that each statement in the program source code has only one node in the graph structure, the edge E of the graph structure G Representing different sentencesControl flow, data dependencies, and control dependencies between. The points of each graph structure correspond to a tree structure representing the grammatical structure within the sentence.
In an example, step 2) includes the steps of:
a) The type field is vectorized encoded using the PACE algorithm. Assuming that the character string to be encoded is S, the encoding result of the PACE algorithm is calculated by the following equation:wherein s is 1 ,s 2 ,...,s k Is the constituent character of S, S 1 ,s 2 ,...,s k E S. onehot is a one-hot encoding algorithm well known in the art.
b) The code field is vectorized encoded using word2vec algorithm. A word2vec model is trained by extracting a word (token) sequence of a program source code, and then a vector output by the model is used as a code of the token. Since the code represented by the node may be made up of multiple token. In this case, the average value of each token code vector may be taken as the code vector of the code field. For example, the eigenvectors of memcpy (buf, str, len) are the average of the eigenvectors of 8 token, i.e., { 'memcpy','(', 'buf',',' str ',', ',' len ',') }.
c) And splicing the vectorization codes of the two fields to obtain the final AST initial vector representation. Initial feature vector of node nCan be formally expressed as +>Wherein, the I represents vector stitching,and->Feature vectors representing the type field and the node field, respectively. />Feature vectors representing token. K is a token contained in the code field.
In an example, in step 3), the specific technical scheme of the vulnerability recognition neural network model is as follows:
a) The vulnerability recognition neural network model consists of a code semantic encoder, a full connection layer and softmax.
b) The code semantic encoder consists of a tree embedding module, a tree pooling module, a graph embedding module and a graph pooling module. The tree embedding module and the tree pooling module are responsible for learning vector representations of ASTs in the CPG to obtain the sentence-level vector representations. The graph embedding module and the graph pooling module are responsible for learning the vector representations of the CFG and the PDG in the CPG to obtain the vector representation of the source code.
c) A tree embedding module in the code semantic encoder updates the vector representations of all nodes in the tree structure using tree-based convolution. At each convolution, the vector representation of the parent node is updated from the vector representations of child nodes in the subtree. Assume that the vector representation of the parent node isThe vector representation of the child node is +.>New vector representation of parent node +.>Calculated by ∈>Wherein W is i The weight matrix of the node i is represented by b, the bias term and sigma, the activation function.
d) A tree pooling module in the code semantic encoder aggregates the vector representations of all nodes in the tree structure into a vector representation of the entire tree structure and takes it as a vector representation of the statement represented by the current tree result. Specifically, the tree pooling module extracts and concatenates together the maximum values of the dimensions in all nodes of the tree structure, resulting in an aggregated vector representation that is used to initialize the vector representation of the graph structure.
e) The graph embedding module in the code semantic encoder updates the vector representations of all nodes in the graph structure using graph-based convolution. At each convolution, the vector representation of the parent node is updated from the vector representations of the neighboring nodes. Assuming that the vector representation of a node isThe vector representation of its one-hop neighbors is +.>New vector representation of parent node +.>Calculated by ∈>Wherein W is a weight matrix, alpha 0i Is the attention weight between the center node and the inode, σ is the activation function.
f) A pooling module in the code semantic encoder aggregates the vector representations of all nodes in the graph structure into a vector representation of the entire graph structure and takes it as a vector representation of the program source code represented by the current graph structure. Specifically, the graph pooling module extracts and splices together the maximum values of each dimension in all nodes in the graph structure, thereby obtaining an aggregated vector representation.
g) The fully connected layer performs linear transformation and nonlinear mapping on the vector representation of the program source code. Assuming that the vector of program source code output in the code semantic encoder is denoted as x, the full-join layer output is logits=σ (wx+b), where w is the weight matrix, b is the bias term, and σ is the activation function.
h) The softmax layer performs two classifications according to the calculation result of the full connection layer. Assuming the full connection layer output is logits, the softmax layer output isThe final classification result is the class in which the larger value of softmax is located. The classification result represents whether a bug exists in the program source code.
In an example, in step 4), the specific technical scheme of the vulnerability localization neural network model is as follows:
a) The vulnerability localization neural network model consists of a code semantic encoder, an attention layer, a full connection layer and a softmax.
b) The code semantic encoder consists of a tree embedding module, a tree pooling module and a graph embedding module, and the structures of the three modules are the same as those in the vulnerability identification neural network model.
c) The attention layer gives different attention weights to different sentences. When classifying statement x, the attention weight of statement x and all other semantics is calculated using statement x as a query term.
d) The fully connected layer performs linear transformation and nonlinear mapping on the vector representation of the program source code. Assuming that the vector of program source code output in the code semantic encoder is denoted as x, the full-join layer output is logits=σ (wx+b), where W is the weight matrix, b is the bias term, and σ is the activation function.
e) The softmax layer performs two classifications according to the calculation result of the full connection layer. Assuming the full connection layer output is logits, the softmax layer output isThe final classification result is the class in which the larger value of softmax is located. The classification result represents whether the current sentence is a vulnerability location.
In an example, in step 5), the specific technical scheme of the joint training is as follows:
a) The parameters of the tree embedding module, the tree pooling module and the graph embedding module in the semantic encoder are shared.
b) The cross entropy loss is used as the loss of the vulnerability identification task and the vulnerability localization task respectively, and the average value of the two losses is defined as the joint loss.
c) The neural network model is trained by optimizing the joint loss by using an Adam optimizer.
In an example, a trained vulnerability recognition neural network model is used to predict whether a given source code has a vulnerability, and a trained vulnerability localization neural network model is used to predict a specific line number where the vulnerability in the given source code is located.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.

Claims (10)

1. A vulnerability discovery method based on multitasking learning comprises the following steps:
constructing a code attribute graph based on an abstract syntax tree, a control flow graph and a program dependency graph of source codes;
separating the tree structure from the graph structure in the code attribute graph;
encoding nodes in the tree structure to obtain an initial node vector representation of the abstract syntax tree;
inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into a vulnerability identification neural network model to obtain a vulnerability identification result; wherein, the vulnerability recognition neural network model comprises: a first code semantic encoder and a first classification module;
the first code semantic encoder is used for calculating a source code vector representation according to an initial node vector representation, a tree structure and a graph structure of the abstract syntax tree;
the first classification module is used for calculating a vulnerability identification result according to the source code vector representation;
inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into a vulnerability positioning neural network model to obtain a vulnerability positioning result; wherein, the vulnerability localization neural network model comprises: the second code semantic encoder, the attention layer and the second classification module;
the second code semantic encoder is used for calculating the final vector representation of the nodes in the graph structure according to the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree;
the attention layer is used for giving attention weight to the final vector representation of each node in the graph structure and carrying out attention calculation;
the second classification module is used for calculating a vulnerability positioning result according to the attention calculation result.
2. The method of claim 1, wherein the code attribute map is constructed by:
1) Generating an attribute graph G of an abstract syntax tree A =(V A ,E A ,λ A ,μ A ) Wherein node set V A Node v in (a) A Edge set E, being a node of an abstract syntax tree A Edge e of (a) A To abstract the edges of the syntax tree, the function lambda A Edge e A Marked as abstract syntax tree edge, function μ A For node v A Endowing code statement type attribute and node code attribute;
2) Generating an attribute map G of a control flow graph C =(V C ,E C ,λ C ) Wherein node set V C Node v in (a) C Edge set E is a node representing sentences and predicates in the abstract syntax tree C Edge e of (a) C Corresponding to the jump edge of the control flow graph, the function lambda C For edge e C Allocating jump condition marks;
3) Generating an attribute graph G of a program dependency graph P =(V P ,E P ,λ P ,μ P ) Wherein node set V P Node v in (a) P Edge set E, being a node of an abstract syntax tree P Edge e of (a) P Function lambda corresponds to the dependency edge of the program dependency graph P For edge e P Assigning a dependency label, the dependency label comprising: control dependence or data dependence, function μ P Edge e marked as data dependent for each piece P Indicating the respective symbol on which it depends, or marking each as controlDependent edge e P A predicate state indicating control dependencies;
4) Combined attribute graph G A Attribute map G C And attribute map G P A code attribute graph g= (V, E, λ, μ) is obtained, where v=v A ,E=E A ∪E C ∪E P ,λ=λ A ∪λ C ∪λ P ,μ=μ A ∪μ P
3. The method of claim 1, wherein the code attribute graph G' = (V) after separation T ,E T ,V G ,E G Lambda, mu), wherein Node set in abstract syntax tree representing the ith sentence,/i>Edge set in abstract syntax tree representing the ith sentence, +.> And->Abstract the root node of the grammar tree for the ith statement, |V G The i is equal to the number of statements in the source code.
4. The method of claim 1, wherein the encoding the nodes in the tree structure comprises:
1) Vectorizing the code statement type attribute by using a PACE algorithm to obtain a code statement type representation;
2) Performing vectorization coding on the node code attribute by using a word2vec algorithm to obtain node code representation;
3) And splicing the code statement type representation and the node code representation to obtain an initial node vector representation of the abstract syntax tree.
5. The method of claim 1, wherein the source code vector representation is calculated by:
1) Updating the vector representations of all nodes in the tree structure according to the initial vector representation of the abstract syntax tree;
2) Aggregating vector representations of all nodes in the updated tree structure to obtain sentence vector representations;
3) Initializing the vector representation of the nodes in the graph structure according to the statement vector representation, and updating the vector representation of the nodes in the graph structure by using the convolution based on the graph to obtain the final vector representation of the nodes in the graph structure;
4) The final vector representations of the nodes in the graph structure are aggregated to obtain a vector representation of the source code.
6. The method of claim 1, wherein the first classification module comprises: the system comprises a full connection layer and a softmax layer, wherein the full connection layer is used for carrying out linear transformation and nonlinear mapping on vector representation of a program source code, and the softmax layer is used for carrying out two classification according to a calculation result of the full connection layer to obtain a vulnerability identification result.
7. The method of any of claims 1 to 6, wherein the vulnerability identification neural network model and the vulnerability localization neural network model are trained by:
1) Sharing parameters of a tree embedding module, a tree pooling module and a graph embedding module in the first code semantic encoder and the second code semantic encoder;
2) The cross entropy loss is used as a vulnerability identification task loss and a vulnerability positioning task loss respectively, and an average value of the vulnerability identification task loss and the vulnerability positioning task loss is defined as a joint loss;
3) And optimizing the joint loss by using an Adam optimizer, and training a vulnerability identification neural network model and a vulnerability positioning neural network model.
8. A vulnerability discovery apparatus based on multitasking learning, comprising:
the construction module is used for constructing a code attribute graph based on the abstract syntax tree, the control flow graph and the program dependency graph of the source code;
the separation module is used for separating the tree structure from the graph structure in the code attribute graph;
the coding module is used for coding the nodes in the tree structure to obtain an initial node vector representation of the abstract syntax tree;
the identifying module is used for inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into the vulnerability identifying neural network model to obtain a vulnerability identifying result; wherein, the vulnerability recognition neural network model comprises: a first code semantic encoder and a first classification module;
the first code semantic encoder is used for calculating a source code vector representation according to an initial node vector representation, a tree structure and a graph structure of the abstract syntax tree;
the first classification module is used for calculating a vulnerability identification result according to the source code vector representation;
the positioning module is used for inputting the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree into the vulnerability positioning neural network model to obtain a vulnerability positioning result; wherein, the vulnerability localization neural network model comprises: the second code semantic encoder, the attention layer and the second classification module;
the second code semantic encoder is used for calculating the final vector representation of the nodes in the graph structure according to the initial node vector representation, the tree structure and the graph structure of the abstract syntax tree;
the attention layer is used for giving attention weight to the final vector representation of each node in the graph structure and carrying out attention calculation;
the second classification module is used for calculating a vulnerability positioning result according to the attention calculation result.
9. A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of any of claims 1-7 when run.
10. A computer device comprising a memory, in which a computer program is stored, and a processor arranged to run the computer program to perform the method of any of claims 1-7.
CN202210125058.3A 2022-02-10 2022-02-10 Vulnerability discovery method and device based on multitask learning Pending CN116628695A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210125058.3A CN116628695A (en) 2022-02-10 2022-02-10 Vulnerability discovery method and device based on multitask learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210125058.3A CN116628695A (en) 2022-02-10 2022-02-10 Vulnerability discovery method and device based on multitask learning

Publications (1)

Publication Number Publication Date
CN116628695A true CN116628695A (en) 2023-08-22

Family

ID=87619883

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210125058.3A Pending CN116628695A (en) 2022-02-10 2022-02-10 Vulnerability discovery method and device based on multitask learning

Country Status (1)

Country Link
CN (1) CN116628695A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117216771A (en) * 2023-11-09 2023-12-12 中机寰宇认证检验股份有限公司 Binary program vulnerability intelligent mining method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117216771A (en) * 2023-11-09 2023-12-12 中机寰宇认证检验股份有限公司 Binary program vulnerability intelligent mining method and system
CN117216771B (en) * 2023-11-09 2024-01-30 中机寰宇认证检验股份有限公司 Binary program vulnerability intelligent mining method and system

Similar Documents

Publication Publication Date Title
CN111639344B (en) Vulnerability detection method and device based on neural network
CN111428044B (en) Method, device, equipment and storage medium for acquiring supervision and identification results in multiple modes
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
CN108491228B (en) Binary vulnerability code clone detection method and system
KR20210023452A (en) Apparatus and method for review analysis per attribute
CN112364352B (en) Method and system for detecting and recommending interpretable software loopholes
CN115357904B (en) Multi-class vulnerability detection method based on program slicing and graph neural network
CN111460472A (en) Encryption algorithm identification method based on deep learning graph network
CN113609488B (en) Vulnerability detection method and system based on self-supervised learning and multichannel hypergraph neural network
CN113158676A (en) Professional entity and relationship combined extraction method and system and electronic equipment
CN115033890A (en) Comparison learning-based source code vulnerability detection method and system
CN113904844A (en) Intelligent contract vulnerability detection method based on cross-modal teacher-student network
CN115344863A (en) Malicious software rapid detection method based on graph neural network
CN112861131A (en) Library function identification detection method and system based on convolution self-encoder
CN113392929B (en) Biological sequence feature extraction method based on word embedding and self-encoder fusion
CN108875024B (en) Text classification method and system, readable storage medium and electronic equipment
CN116628695A (en) Vulnerability discovery method and device based on multitask learning
CN114115894A (en) Cross-platform binary code similarity detection method based on semantic space alignment
CN117725592A (en) Intelligent contract vulnerability detection method based on directed graph annotation network
CN117633811A (en) Code vulnerability detection method based on multi-view feature fusion
CN116595537A (en) Vulnerability detection method of generated intelligent contract based on multi-mode features
CN114780403A (en) Software defect prediction method and device based on enhanced code attribute graph
CN111860662B (en) Training method and device, application method and device of similarity detection model
CN117556425B (en) Intelligent contract vulnerability detection method, system and equipment based on graph neural network
CN113537372B (en) Address recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination