CN115033890A - Comparison learning-based source code vulnerability detection method and system - Google Patents
Comparison learning-based source code vulnerability detection method and system Download PDFInfo
- Publication number
- CN115033890A CN115033890A CN202210748624.6A CN202210748624A CN115033890A CN 115033890 A CN115033890 A CN 115033890A CN 202210748624 A CN202210748624 A CN 202210748624A CN 115033890 A CN115033890 A CN 115033890A
- Authority
- CN
- China
- Prior art keywords
- graph
- vulnerability
- code
- unit
- positive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/43—Checking; Contextual analysis
- G06F8/436—Semantic checking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Stored Programmes (AREA)
Abstract
The invention discloses a source code vulnerability detection method based on contrast learning, which is characterized in that a required sample pair is generated by customizing a semantic retention heuristic method, meanwhile, vulnerability codes are expressed from the angle from a code token to a sentence to a graph, the deep semantic information of vulnerabilities can be better expressed by utilizing a code attribute graph, the relation between the vulnerability codes and the context is fully excavated, and the source code vulnerability detection system based on the contrast learning is correspondingly provided by adopting the contrast learning technology to train on small samples to realize the high-precision requirement of small sample data.
Description
Technical Field
The invention relates to the field of software security, in particular to a source code vulnerability detection method and system based on comparison learning.
Background
Vulnerability detection is an important component in the software maintenance process. The existing deep learning model depends on a large number of marked vulnerability samples, the marked vulnerability samples are difficult to obtain in the actual development process, manual marking can cause a large amount of resource waste, and the correctness needs to be studied. The sample size of the existing labeled data set is not large enough to train a detection model with a high detection rate. Therefore, a detection model trained based on the existing labeled data set method has low detection precision and is difficult to be used for a vulnerability detection task in an actual environment.
There are some works that use the method of contrast LEARNING to detect software bugs AND their types, such as the literature "contrast LEARNING FOR software CODE WITH structured AND FUNCTIONAL PROPERTIES" that constructs bug samples by manually injecting bug CODEs, which still rely on a priori expert experience AND still require manual work therein. There Are also some works that use Deep Learning methods to detect software bugs, such as the document Deep Learning based Vulnerability Detection, Are We the Yet? Rich semantic information in a source code is not considered comprehensively, and vulnerability semantic features of different types are difficult to cover, so that the method has certain limitation in capturing deep semantic information of the code.
Disclosure of Invention
The invention aims to: the invention aims to provide a source code vulnerability detection method and system based on contrast learning, which can meet the high-precision requirement of small sample data, automatically generate required sample pairs by a customized semantic retention heuristic method, and fully mine the relation between vulnerability codes and context by utilizing semantic information of vulnerability deep level.
The technical scheme is as follows: the invention provides a source code vulnerability detection method based on comparison learning, which comprises the following steps:
1) acquiring vulnerability data in a vulnerability database NVD and an open source code library GitHub to construct a vulnerability data set, and performing data enhancement on the vulnerability data set to form positive and negative sample pairs;
2) performing code feature representation learning on the positive and negative sample pairs, converting the positive and negative sample pairs into a code attribute graph, and forming a joint graph by a control flow graph CFG (computational fluid dynamics) and a program dependence graph PDG (dependency graph) contained in the code attribute graph;
3) performing graph embedding on the joint graph and applying contrast learning to obtain feature joint graph vectors of positive and negative sample pairs, adding a classifier, and taking the feature joint graph vectors of the positive and negative sample pairs as the input of the classifier to obtain a prediction label so as to construct a vulnerability detection model;
4) and extracting vulnerability codes from the vulnerability database NVD and the open source code library GitHub to serve as codes to be detected, and analyzing whether the codes to be detected contain vulnerabilities or not by using a vulnerability detection model.
Further, in step 1), constructing a vulnerability data set for data enhancement, including the following steps:
1.1) collecting vulnerability data in a vulnerability database NVD and an open source code library Github, and extracting vulnerability files and corresponding vulnerability patch files;
1.2) preprocessing the extracted vulnerability file and the corresponding vulnerability patch file, extracting vulnerability codes in the vulnerability file and patch codes in the corresponding vulnerability patch file from the function level, and removing redundant information, wherein the redundant information comprises a header file, comments and declared global parameters to obtain a vulnerability data set;
1.3) selecting the vulnerability codes from the vulnerability data set as hard negative samples, using the corresponding patch codes as original samples, keeping the applied semantics of the original samples in a heuristic manner to change the syntactic structure of the original samples without changing the semantics, thereby generating positive samples, and combining the corresponding positive samples, the original samples and the hard negative samples as positive and negative sample pairs.
Further, the step 2) specifically comprises the following steps:
2.1) performing code characterization on all samples in each positive and negative sample pair, and generating a code attribute graph CPG for the positive and negative sample pairs through a tool Joern, wherein the code attribute graph CPG comprises an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG;
2.2) segmenting the encoding BPE by using bytes from leaf nodes of the abstract syntax tree AST, embedding words into the obtained code token tokens by using Word2vec, and generating sentence embedding vectors according to the hierarchical structure in the abstract syntax tree AST by using a tree-based convolutional neural network TBNN;
2.3) executing maximum pooling to aggregate the vectors of all nodes of the tree into a single vector containing corresponding statement semantics through a tree aggregation layer, and using the single vector as an initial feature vector of a corresponding statement node in a control flow graph CFG and a program dependency graph PDG;
2.4) combining the control flow graph CFG and the program dependence graph PDG into a joint graph, and inputting the vectors of the statements in the control flow graph CFG, the program dependence graph PDG and the joint graph into a code representation layer at the graph level.
Further, the step 3) specifically comprises the following steps:
3.1) calculating attention weight coefficients corresponding to all nodes in the joint graph according to three edges of control flow, control dependence and data dependence in a control flow graph CFG and a program dependence graph PDG;
3.2) the graph level code representation layer is composed of a plurality of graph attention layers, one graph attention layer corresponds to the type of one edge, and the graph attention layer updates the node vector representation in the joint graph according to the attention weight coefficient;
3.3) averaging the plurality of graph attention layers to form a multi-head attention layer to obtain the final vector representation of the nodes in the graph;
3.4) obtaining a characteristic joint graph vector of the positive and negative sample pairs according to different node importance in the joint graph;
3.5) taking the original sample and the positive sample as a similar pair, and taking the original sample and the hard negative sample as a dissimilar pair;
3.6) using cosine similarity to minimize loss;
3.7) adding a classifier, taking the feature joint graph vector of the positive and negative sample pairs as the input of the classifier, training the classifier by using an activation function softmax to obtain a prediction label, and if the prediction label is 1, determining the prediction label as a vulnerability code, and if the label is 0, determining the prediction label as a non-vulnerability code.
Further, the step 4) specifically comprises the following steps:
4.1) extracting vulnerability codes from a vulnerability database NVD and an open source code library GitHub to serve as codes to be detected, and executing the step 2) on the code files to be detected to generate corresponding code attribute graphs;
4.2) using a code attribute graph obtained through code characteristic representation learning as the input of a vulnerability detection model to obtain a prediction label of the code to be detected;
4.3) outputting a prediction label, wherein if the prediction label is 1, the prediction label is a bug code, and if the prediction label is 0, the prediction label is a non-bug code.
The invention correspondingly provides a source code vulnerability detection system based on comparison learning, which comprises a data set construction module, a feature learning module, a model construction module and a vulnerability detection module;
the data set construction module is used for collecting vulnerability data in a vulnerability database NVD and an open source code library GitHub to construct a vulnerability data set, and performing data enhancement on the vulnerability data set to form positive and negative sample pairs;
the characteristic learning module is used for carrying out code characteristic representation learning on the positive and negative sample pairs and converting the positive and negative sample pairs into a code attribute graph, and a control flow graph CFG and a program dependency graph PDG contained in the code attribute graph jointly form a joint graph;
the model building module is used for carrying out graph embedding on the joint graph and applying contrast learning to obtain feature joint graph vectors of positive and negative sample pairs, meanwhile, a classifier is added, and the feature joint graph vectors of the positive and negative sample pairs are used as the input of the classifier to obtain a prediction label so as to build a vulnerability detection model;
the vulnerability detection module is used for extracting vulnerability codes from a vulnerability database NVD and an open source code library Github to serve as codes to be detected, and analyzing whether the codes to be detected contain vulnerabilities or not by using a vulnerability detection model.
Furthermore, the data set construction module also comprises an acquisition unit, an extraction unit and a generation unit;
the collecting unit is used for collecting vulnerability data in a vulnerability database NVD and an open source code library GitHub, and extracting vulnerability files and corresponding vulnerability patch files;
the extraction unit is used for preprocessing the extracted vulnerability file and the corresponding vulnerability patch file, extracting vulnerability codes in the vulnerability file and patch codes in the corresponding vulnerability patch file from the function level, and removing redundant information, wherein the redundant information comprises a header file, comments and declared global parameters to obtain a vulnerability data set;
the generating unit is used for selecting the vulnerability codes from the vulnerability data set as hard negative samples, using the corresponding patch codes as original samples, keeping the application semantics of the original samples in a heuristic manner to change the syntactic structure of the original samples without changing the semantics, thereby generating positive samples, and combining the corresponding positive samples, the original samples and the hard negative samples as positive and negative sample pairs.
Furthermore, the feature learning module also comprises a representation unit, a statement embedding unit, a node aggregation unit and a graph embedding preparation unit;
the characterization unit is used for performing code characterization on all samples in each positive and negative sample pair, and generating a code attribute graph CPG for the positive and negative sample pairs through a tool Joern, wherein the code attribute graph CPG comprises an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG;
the sentence embedding unit is used for segmenting the encoding BPE by using bytes from leaf nodes of the abstract syntax tree AST, embedding words into the obtained code token tokens by using Word2vec, and generating sentence embedding vectors according to the hierarchical structure in the abstract syntax tree AST by using a tree-based convolutional neural network TBNN;
the node aggregation unit is used for executing maximum pooling to aggregate the vectors of all nodes of the tree into a single vector containing corresponding statement semantics through a tree aggregation layer, and the single vector is used as an initial feature vector of a corresponding statement node in a control flow graph CFG and a program dependency graph PDG;
the graph embedding preparation unit is used for combining the control flow graph CFG and the program dependency graph PDG into a joint graph, and inputting the vectors of the statements in the control flow graph CFG, the program dependency graph PDG and the joint graph into a graph-level code representation layer.
Furthermore, the model building module also comprises an attention weight calculating unit, a graph attention layer unit, a multi-head attention layer unit, a graph node aggregation unit, a comparison rule unit, a loss minimizing unit and a training classifier unit;
the attention weight calculation unit is used for calculating the attention weight coefficient corresponding to each node in the joint graph according to three edges of control flow, control dependence and data dependence in the control flow graph CFG and the program dependence graph PDG;
the graph attention layer unit is used for forming a plurality of graph attention layers according to a graph level code representation layer, one graph attention layer corresponds to the type of one edge, and the graph attention layer updates the node vector representation in the combined graph according to the attention weight coefficient;
the multi-head attention layer unit is used for averaging a plurality of graph attention layers to form a multi-head attention layer so as to obtain a final vector representation of a node in the graph;
the graph node aggregation unit is used for obtaining a feature joint graph vector of the positive and negative sample pairs according to different node importance in the joint graph;
the comparison rule unit is used for taking the original sample and the positive sample as a similar pair and taking the original sample and the hard negative sample as a dissimilar pair;
the loss minimizing unit is used for minimizing the loss by using the cosine similarity;
the training classifier unit is used for adding a classifier, taking the feature joint graph vector of the positive and negative sample pairs as the input of the classifier, training the classifier by using an activation function softmax to obtain a prediction label, and if the prediction label is 1, determining the prediction label as a vulnerability code, and if the prediction label is 0, determining the prediction label as a non-vulnerability code.
Furthermore, the vulnerability detection module also comprises a graph generation unit, a detection unit and an output unit;
the graph generating unit is used for extracting vulnerability codes from a vulnerability database NVD and an open source code library GitHub as codes to be detected, and executing a feature learning module on a code file to be detected to generate a corresponding code attribute graph;
the detection unit is used for taking a code attribute graph obtained through code characteristic representation learning as the input of a vulnerability detection model to obtain a prediction tag of a code to be detected;
the output unit is used for outputting the prediction tag, if the prediction tag is 1, the code is a bug code, and if the prediction tag is 0, the code is a non-bug code.
Has the advantages that: compared with the prior art, the method has the remarkable characteristics that the needed sample pairs are generated by customizing the semantic retention heuristic method, the vulnerability codes are expressed from the angles from the code tokens to the sentences to the graph, the deep semantic information of the vulnerability can be better expressed by utilizing the code attribute graph, the relation between the vulnerability codes and the context is fully excavated, and the high-precision requirement of the small sample data is realized by training on the small sample by adopting a contrast learning technology.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of the architecture of a tree-based convolutional neural network of the present invention;
FIG. 3 is a multi-headed graph attention layer in the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Example 1
The invention provides a source code vulnerability detection method based on comparative learning, which is shown in fig. 1 and comprises the following steps:
1) acquiring vulnerability data in a vulnerability database NVD and an open source code library GitHub to construct a vulnerability data set, and performing data enhancement on the vulnerability data set to form positive and negative sample pairs;
1.1) collecting vulnerability data in a vulnerability database NVD and an open source code library Github, and extracting vulnerability files and corresponding vulnerability patch files;
1.2) preprocessing the extracted vulnerability file and the corresponding vulnerability patch file, extracting vulnerability codes in the vulnerability file and patch codes in the corresponding vulnerability patch file from a function level, and removing redundant information, wherein the redundant information comprises a header file, comments and declared global parameters to obtain a vulnerability data set;
1.3) selecting vulnerability codes from a vulnerability data set as hard negative samples, using corresponding patch codes as original samples, and reserving the application semantics of the original samples to ensure that the syntax structure of the original samples is changed and the semantics are not changed, thereby generating positive samples, and combining the corresponding positive samples, the original samples and the hard negative samples as positive and negative sample pairs;
the semantic preservation heuristic rule comprises the following steps:
rule 1: introducing a new else branch, wherein the rule introduces a new else branch statement into the existing if conditional statement or reverses the condition in the if conditional statement;
rule 2: conditional expression with ANDing, for inclusion of condition C 1 And C 2 Is rewritten as a nested if structure;
rule 3: introducing a protection device at the outer layer of the for loop, and changing a program containing the for loop statement into a new program structure by adding an if (true) judgment statement at the outer layer of the loop;
rule 4: introducing a protection device at the outer layer of the while loop, adding a while judgment statement through the outer layer of the recycle loop, and changing the program containing the for loop statement into a new program structure;
rule 5: adding conditional break statement in while loop, wherein the expression is a condition C 1 The while loop of (2) can be changed into another infinite loop statement, and a C needs to be added into the loop statement body 1 A conditional break statement.
2) Performing code feature representation learning on the positive and negative sample pairs, converting the positive and negative sample pairs into a code attribute graph, and forming a joint graph by a control flow graph CFG (computational fluid dynamics) and a program dependence graph PDG (dependency graph) contained in the code attribute graph;
2.1) performing code characterization on all samples in each positive and negative sample pair, and generating a code attribute graph CPG for the positive and negative sample pairs through a tool Joern, wherein the code attribute graph CPG comprises an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG;
2.2) please refer to fig. 2, the leaf nodes of the abstract syntax tree AST are participled by using bytes to encode BPE, Word2vec is used to embed words into the obtained code token tokens, and a tree-based convolutional neural network TBCNN is used to generate sentence embedding vectors according to the hierarchical structure in the abstract syntax tree AST;
the convolutional neural network TBCNN based on the tree executes the following process:
the parent node representation is updated from the child node representations by sliding the convolution kernel of a triangle over the tree. Wherein the updated calculation formula is:
in the formula (I), the compound is shown in the specification,is the updated parent node vector representation, H is the set of nodes in the triangle convolution kernel that contain the parent node itself,is a vector representation of nodes in the H set, sigma represents an activation function, W q And b represents a weight matrix and a bias term, respectively; the main components of TBCNN include: representation and coding of vectors, tree-based convolution, dynamic pooling, a fully-connected concealment layer and an output layer;
2.3) through a tree aggregation layer, performing maximum pooling to aggregate vectors of all nodes of the tree into a single vector containing corresponding statement semantics, and using the single vector as an initial feature vector of a corresponding statement node in a control flow graph CFG and a program dependency graph PDG:
in the formula, h G Is CFG, the vector of nodes in PDG passing through tree aggregation layer represents set, TE () represents tree embedding process, V G Is a node set of CFG, PDG, AST n AST whose root node is n;
2.4) combining the control flow graph CFG and the program dependence graph PDG into a joint graph, and inputting the vectors of the statements in the control flow graph CFG, the program dependence graph PDG and the joint graph into a code representation layer at the graph level.
3) Performing graph embedding on the joint graph and applying contrast learning to obtain feature joint graph vectors of positive and negative sample pairs, adding a classifier, and taking the feature joint graph vectors of the positive and negative sample pairs as the input of the classifier to obtain a prediction label so as to construct a vulnerability detection model;
3.1) calculating attention weight coefficients corresponding to all nodes in the joint graph according to three edges of control flow, control dependence and data dependence in a control flow graph CFG and a program dependence graph PDG:
in the formula, N e The set of neighbors to node i under edge e (including node i itself),for node i and node j under edge e (j belongs to N) e ) Attention weights in between, | | represents the stitching operation, exp () is an exponential function with a natural constant e as the base,is a transpose of a learnable weight vector,the association degree of the node i and the node j of the edge e can be considered, and the larger the value is, the stronger the association between the two nodes is, and the larger the attention weight coefficient is;
3.2) the graph level code representation layer is composed of a plurality of graph attention layers, one graph attention layer corresponds to the type of one edge, and the graph attention layer updates the node vector representation in the joint graph according to the attention weight coefficient obtained in the step 3.1):
in the formula (I), the compound is shown in the specification,for vector characterization of nodes, σ is the activation function, W e Is the weight coefficient of edge e;
3.3) referring to FIG. 3, averaging multiple graph attention levels to form a multi-head attention level, and obtaining a final vector characterization of nodes in the graph:
the multi-head attention mechanism splices or averages the features of each head to obtain a new feature representation;
3.4) obtaining a feature joint graph vector of the positive and negative sample pairs according to different node importance in the joint graph:
wherein AGG is the polymerization function, V G Set of nodes of the graph structure, α n The attention weight for node n is expressed as:
in the formula (I), the compound is shown in the specification,is a transpose of a learnable weight vector;
3.5) taking the original sample and the positive sample as a similar pair, and taking the original sample and the hard negative sample as a dissimilar pair;
3.6) the cosine similarity is used to minimize the loss, and the specific comparison learning loss is as follows:
in the formula, sim () is a cosine similarity function, N is the number of positive and negative sample pairs, h is a vector representation of an original sample graph, and h is + For vector characterization of the positive sample plot, h - For vector characterization of the negative sample graph, τ is a temperature parameter;
3.7) adding a classifier, taking the feature joint graph vector of the positive and negative sample pairs as the input of the classifier, training the classifier by using an activation function softmax to obtain a predicted label, and if the predicted label is 1, determining the predicted label as a loophole code, and if the predicted label is 0, determining the predicted label as a non-loophole code.
4) And extracting vulnerability codes from the vulnerability database NVD and the open source code library GitHub to serve as codes to be detected, and analyzing whether the codes to be detected contain vulnerabilities or not by using a vulnerability detection model.
4.1) extracting vulnerability codes from a vulnerability database NVD and an open source code library GitHub to serve as codes to be detected, and executing the step 2) on the code files to be detected to generate corresponding code attribute graphs;
4.2) using a code attribute graph obtained through code characteristic representation learning as the input of a vulnerability detection model to obtain a prediction label of the code to be detected;
4.3) outputting a prediction label, wherein if the prediction label is 1, the prediction label is a bug code, and if the prediction label is 0, the prediction label is a non-bug code.
Example 2
Corresponding to the source code vulnerability detection method based on the comparative learning in embodiment 1, in embodiment 2, a source code vulnerability detection system based on the comparative learning is provided, which includes a data set construction module, a feature learning module, a model construction module, and a vulnerability detection module, as shown in fig. 1;
the data set construction module is used for collecting vulnerability data in a vulnerability database NVD and an open source code library GitHub to construct a vulnerability data set, and performing data enhancement on the vulnerability data set to form positive and negative sample pairs;
the data set building module also comprises an acquisition unit, an extraction unit and a generation unit;
the collecting unit is used for collecting vulnerability data in a vulnerability database NVD and an open source code library GitHub, and extracting vulnerability files and corresponding vulnerability patch files;
the extraction unit is used for preprocessing the extracted vulnerability file and the corresponding vulnerability patch file, extracting vulnerability codes in the vulnerability file and patch codes in the corresponding vulnerability patch file from the function level, and removing redundant information, wherein the redundant information comprises a header file, comments and declared global parameters to obtain a vulnerability data set;
the generating unit is used for selecting vulnerability codes from the vulnerability data set as hard negative samples, corresponding patch codes as original samples, and preserving the application semantics of the original samples in a heuristic manner to change the syntactic structure of the original samples without changing the semantics, so as to generate positive samples, and combining the corresponding positive samples, the original samples and the hard negative samples as positive and negative sample pairs;
wherein, the semantic preservation heuristic rule is as follows:
rule 1: introducing a new else branch, wherein the rule introduces a new else branch statement into the existing if conditional statement or reverses the condition in the if conditional statement;
rule 2: conditional expression with ANDing, for inclusion of condition C 1 And C 2 Is rewritten as a nested if structure;
rule 3: introducing a protection device at the outer layer of the for loop, and changing the program containing the for loop statement into a new program structure by adding an if (true) judgment statement at the outer layer of the loop;
rule 4: introducing a protection device at the outer layer of the while loop, adding a while judgment statement through the outer layer of the recycle loop, and changing the program containing the for loop statement into a new program structure;
rule 5: adding conditional break statement in while loop, wherein the expression is a condition C 1 The while loop of (2) can be changed into another infinite loop statement, and a C needs to be added into the loop statement body 1 A conditional break statement.
The characteristic learning module is used for carrying out code characteristic representation learning on the positive and negative sample pairs and converting the positive and negative sample pairs into a code attribute graph, and a control flow graph CFG and a program dependency graph PDG contained in the code attribute graph jointly form a joint graph;
the feature learning module also comprises a representation unit, a statement embedding unit, a node aggregation unit and a graph embedding preparation unit;
the characterization unit is used for performing code characterization on all samples in each positive and negative sample pair, and generating a code attribute graph CPG for the positive and negative sample pairs through a tool Joern, wherein the code attribute graph CPG comprises an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG;
referring to fig. 2, the sentence embedding unit is configured to divide the leaf nodes of the abstract syntax tree AST into words by using bytes for encoding BPE, perform Word embedding on the obtained code token tokens by using Word2vec, and generate a sentence embedding vector according to the hierarchical structure in the abstract syntax tree AST by using a tree-based convolutional neural network TBCNN;
the convolutional neural network TBCNN based on the tree executes the following process:
the parent node representation is updated from the child node representations by sliding the convolution kernels of one triangle over the tree. Wherein the updated calculation formula is:
in the formula (I), the compound is shown in the specification,is the updated parent node vector representation, H is the set of nodes in the triangular convolution kernel that contain the parent node itself,is a vector representation of nodes in the H set, sigma represents an activation function, W q And b represents a weight matrix and a bias term, respectively; the main components of TBCNN include: representation and coding of vectors, tree-based convolution, dynamic pooling, a fully-connected concealment layer and an output layer;
the node aggregation unit is used for executing maximum pooling to aggregate vectors of all nodes of the tree into a single vector containing corresponding statement semantics through a tree aggregation layer, and the single vector is used as an initial feature vector of a corresponding statement node in the control flow graph CFG and the program dependency graph PDG:
in the formula, h G Is CFG, the vector of nodes in PDG passing through tree aggregation layer represents set, TE () represents tree embedding process, V G Is a node set of CFG, PDG, AST n AST whose root node is n;
the graph embedding preparation unit is used for combining the control flow graph CFG and the program dependency graph PDG into a joint graph, and inputting the vectors of the statements in the control flow graph CFG, the program dependency graph PDG and the joint graph into a graph-level code representation layer.
The model building module is used for carrying out graph embedding on the joint graph and applying contrast learning to obtain feature joint graph vectors of positive and negative sample pairs, meanwhile, a classifier is added, the feature joint graph vectors of the positive and negative sample pairs are used as the input of the classifier to obtain a prediction label, and therefore a vulnerability detection model is built;
the model building module also comprises an attention weight calculating unit, a graph attention layer unit, a multi-head attention layer unit, a graph node aggregation unit, a comparison rule unit, a loss minimizing unit and a training classifier unit;
the attention weight calculation unit is used for calculating the attention weight coefficient corresponding to each node in the joint graph according to three edges of control flow, control dependence and data dependence in the control flow graph CFG and the program dependence graph PDG:
in the formula, N e The set of neighbors of node i under edge e (including node i itself),for node i and node j under edge e (j belongs to N) e ) Attention weights in between, | | represents the stitching operation, exp () is an exponential function with a natural constant e as the base,is a transpose of a learnable weight vector,the greater the value of the association degree between the node i and the node j of the edge e, the stronger the association between the two nodes, and the greater the attention weight coefficient;
the graph attention layer unit is used for forming a plurality of graph attention layers according to a graph level code representation layer, one graph attention layer corresponds to the type of one edge, and the graph attention layer updates the node vector representation in the combined graph according to the attention weight coefficient:
in the formula (I), the compound is shown in the specification,for vector characterization of nodes, σ is the activation function, W e Is the weight coefficient of edge e;
the multi-head attention layer unit is used for averaging a plurality of graph attention layers to form a multi-head attention layer, and obtaining a final vector representation of a node in the graph:
the multi-head attention mechanism splices or averages the features of each head to obtain a new feature representation;
the graph node aggregation unit is used for obtaining a feature joint graph vector of the positive and negative sample pairs according to different node importance in the joint graph:
wherein AGG is the polymerization function, V G Set of nodes of the graph structure, α n The attention weight for node n is expressed as:
in the formula (I), the compound is shown in the specification,is a transpose of a learnable weight vector;
the comparison rule unit is used for taking the original sample and the positive sample as a similar pair and taking the original sample and the hard negative sample as a dissimilar pair;
the loss minimization unit is used for minimizing loss by using cosine similarity, and the specific comparison learning loss is as follows:
in the formula, sim () is a cosine similarity function, N is the number of positive and negative sample pairs, h is a vector representation of an original sample graph, and h is + For vector characterization of the positive sample plot, h - For vector characterization of the negative sample graph, τ is a temperature parameter;
the training classifier unit is used for adding a classifier, taking the feature joint graph vector of the positive and negative sample pairs as the input of the classifier, training the classifier by using an activation function softmax to obtain a prediction label, and if the prediction label is 1, determining the prediction label as a bug code, and if the prediction label is 0, determining the prediction label as a non-bug code.
The vulnerability detection module extracts vulnerability codes from a vulnerability database NVD and an open source code library GitHub as codes to be detected, and analyzes whether the codes to be detected contain vulnerabilities or not by using a vulnerability detection model;
the vulnerability detection module also comprises a graph generation unit, a detection unit and an output unit;
the graph generating unit is used for extracting vulnerability codes from a vulnerability database NVD and an open source code library GitHub as codes to be detected, and executing a feature learning module on a code file to be detected to generate a corresponding code attribute graph;
the detection unit is used for taking a code attribute graph obtained through code characteristic representation learning as the input of a vulnerability detection model to obtain a prediction label of a code to be detected;
the output unit is used for outputting the prediction label, if the prediction label is 1, the code is a bug code, and if the prediction label is 0, the code is a non-bug code.
Claims (10)
1. A source code vulnerability detection method based on comparative learning is characterized by comprising the following steps:
1) acquiring vulnerability data in a vulnerability database NVD and an open source code library GitHub to construct a vulnerability data set, and performing data enhancement on the vulnerability data set to form positive and negative sample pairs;
2) performing code feature representation learning on the positive and negative sample pairs, converting the positive and negative sample pairs into a code attribute graph, and forming a joint graph by a control flow graph CFG and a program dependency graph PDG which are contained in the code attribute graph;
3) performing graph embedding on the joint graph and applying contrast learning to obtain feature joint graph vectors of positive and negative sample pairs, adding a classifier, and taking the feature joint graph vectors of the positive and negative sample pairs as the input of the classifier to obtain a prediction label so as to construct a vulnerability detection model;
4) and extracting vulnerability codes from the vulnerability database NVD and the open source code library Github as codes to be detected, and analyzing whether the codes to be detected contain vulnerabilities or not by using a vulnerability detection model.
2. The source code vulnerability detection method based on comparative learning according to claim 1, wherein the step 1) specifically comprises the following steps:
1.1) collecting vulnerability data in a vulnerability database NVD and an open source code library Github, and extracting vulnerability files and corresponding vulnerability patch files;
1.2) preprocessing the extracted vulnerability file and the corresponding vulnerability patch file, extracting vulnerability codes in the vulnerability file and patch codes in the corresponding vulnerability patch file from the function level, and removing redundant information, wherein the redundant information comprises a header file, comments and declared global parameters to obtain a vulnerability data set;
1.3) selecting the vulnerability codes from the vulnerability data set as hard negative samples, using the corresponding patch codes as original samples, keeping the applied semantics of the original samples in a heuristic manner to change the syntactic structure of the original samples without changing the semantics, thereby generating positive samples, and combining the corresponding positive samples, the original samples and the hard negative samples as positive and negative sample pairs.
3. The source code vulnerability detection method based on comparative learning according to claim 2, wherein the step 2) specifically comprises the following steps:
2.1) performing code characterization on all samples in each positive and negative sample pair, and generating a code attribute graph CPG for the positive and negative sample pairs through a tool Joern, wherein the code attribute graph CPG comprises an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG;
2.2) segmenting the encoding BPE by using bytes from leaf nodes of the abstract syntax tree AST, embedding words into the obtained code token tokens by using Word2vec, and generating sentence embedding vectors according to the hierarchical structure in the abstract syntax tree AST by using a tree-based convolutional neural network TBNN;
2.3) through a tree aggregation layer, performing maximum pooling to aggregate vectors of all nodes of the tree into a single vector containing corresponding statement semantics, and using the single vector as an initial feature vector of a corresponding statement node in a control flow graph CFG and a program dependency graph PDG;
2.4) combining the control flow graph CFG and the program dependence graph PDG into a joint graph, and inputting the vectors of the statements in the control flow graph CFG, the program dependence graph PDG and the joint graph into a code representation layer at the graph level.
4. The source code vulnerability detection method based on comparative learning according to claim 3, wherein the step 3) specifically comprises the following steps:
3.1) calculating attention weight coefficients corresponding to all nodes in the joint graph according to three edges of control flow, control dependence and data dependence in a control flow graph CFG and a program dependence graph PDG;
3.2) the graph level code representation layer is composed of a plurality of graph attention layers, one graph attention layer corresponds to the type of one edge, and the graph attention layer updates the node vector representation in the joint graph according to the attention weight coefficient;
3.3) averaging the plurality of graph attention layers to form a multi-head attention layer to obtain the final vector representation of the nodes in the graph;
3.4) obtaining a feature joint graph vector of the positive and negative sample pairs according to different node importance in the joint graph;
3.5) taking the original sample and the positive sample as a similar pair, and taking the original sample and the hard negative sample as a dissimilar pair;
3.6) using cosine similarity to minimize loss;
3.7) adding a classifier, taking the feature joint graph vector of the positive and negative sample pairs as the input of the classifier, training the classifier by using an activation function softmax to obtain a predicted label, and if the predicted label is 1, determining the predicted label as a loophole code, and if the predicted label is 0, determining the predicted label as a non-loophole code.
5. The source code vulnerability detection method based on comparative learning according to claim 1, wherein the step 4) specifically comprises the following steps:
4.1) extracting vulnerability codes from a vulnerability database NVD and an open source code library GitHub to serve as codes to be detected, and executing the step 2) on the code files to be detected to generate corresponding code attribute graphs;
4.2) using a code attribute graph obtained through code characteristic representation learning as the input of a vulnerability detection model to obtain a prediction label of the code to be detected;
4.3) outputting a prediction label, wherein if the prediction label is 1, the prediction label is a bug code, and if the prediction label is 0, the prediction label is a non-bug code.
6. A source code vulnerability detection system based on contrast learning is characterized by comprising a data set construction module, a feature learning module, a model construction module and a vulnerability detection module;
the data set construction module is used for collecting vulnerability data in a vulnerability database NVD and an open source code library GitHub to construct a vulnerability data set, and performing data enhancement on the vulnerability data set to form positive and negative sample pairs;
the characteristic learning module is used for carrying out code characteristic representation learning on the positive and negative sample pairs and converting the positive and negative sample pairs into a code attribute graph, and a control flow graph CFG and a program dependency graph PDG contained in the code attribute graph jointly form a joint graph;
the model building module is used for carrying out graph embedding on the joint graph and applying contrast learning to obtain feature joint graph vectors of positive and negative sample pairs, meanwhile, a classifier is added, and the feature joint graph vectors of the positive and negative sample pairs are used as the input of the classifier to obtain a prediction label so as to build a vulnerability detection model;
the vulnerability detection module is used for extracting vulnerability codes from the vulnerability database NVD and the open source code library GitHub to serve as codes to be detected, and analyzing whether the codes to be detected contain vulnerabilities or not by using a vulnerability detection model.
7. The source code vulnerability detection system based on comparative learning of claim 6, wherein the data set construction module further comprises a collection unit, an extraction unit, a generation unit;
the collecting unit is used for collecting vulnerability data in a vulnerability database NVD and an open source code library GitHub, and extracting vulnerability files and corresponding vulnerability patch files;
the extraction unit is used for preprocessing the extracted vulnerability file and the corresponding vulnerability patch file, extracting vulnerability codes in the vulnerability file and patch codes in the corresponding vulnerability patch file from the function level, and removing redundant information, wherein the redundant information comprises a header file, comments and declared global parameters to obtain a vulnerability data set;
the generating unit is used for selecting the vulnerability codes from the vulnerability data set as hard negative samples, corresponding patch codes are used as original samples, the applied semantics of the original samples are kept in a heuristic mode to enable the grammar structure of the original samples to be changed and the semantics to be unchanged, therefore, the positive samples are generated, and the corresponding positive samples, the original samples and the hard negative samples are combined to be used as positive and negative sample pairs.
8. The source code vulnerability detection system based on comparative learning of claim 7, wherein the feature learning module further comprises a characterization unit, a sentence embedding unit, a node aggregation unit, a graph embedding preparation unit;
the characterization unit is used for performing code characterization on all samples in each positive and negative sample pair, and generating a code attribute graph CPG for the positive and negative sample pairs through a tool Joern, wherein the code attribute graph CPG comprises an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG;
the sentence embedding unit is used for segmenting the encoding BPE by using bytes from leaf nodes of the abstract syntax tree AST, embedding words into the obtained code token tokens by using Word2vec, and generating sentence embedding vectors according to the hierarchical structure in the abstract syntax tree AST by using a tree-based convolutional neural network TBNN;
the node aggregation unit is used for executing maximum pooling to aggregate the vectors of all nodes of the tree into a single vector containing corresponding statement semantics through a tree aggregation layer, and the single vector is used as an initial feature vector of a corresponding statement node in a control flow graph CFG and a program dependency graph PDG;
the graph embedding preparation unit is used for combining the control flow graph CFG and the program dependency graph PDG into a joint graph, and inputting the vectors of the statements in the control flow graph CFG, the program dependency graph PDG and the joint graph into a graph-level code representation layer.
9. The source code vulnerability detection system based on comparative learning of claim 8, wherein the model construction module further comprises an attention weight calculation unit, a graph attention layer unit, a multi-head attention layer unit, a graph node aggregation unit, a comparative rule unit, a loss minimization unit, a training classifier unit;
the attention weight calculation unit is used for calculating the attention weight coefficient corresponding to each node in the joint graph according to three edges of control flow, control dependence and data dependence in a control flow graph CFG and a program dependence graph PDG;
the graph attention layer unit is used for forming a plurality of graph attention layers according to a graph level code representation layer, one graph attention layer corresponds to the type of one edge, and the graph attention layer updates the node vector representation in the combined graph according to the attention weight coefficient;
the multi-head attention layer unit is used for averaging a plurality of graph attention layers to form a multi-head attention layer so as to obtain the final vector representation of the nodes in the graph;
the graph node aggregation unit is used for obtaining a feature joint graph vector of the positive and negative sample pairs according to different node importance in the joint graph;
the comparison rule unit is used for taking the original sample and the positive sample as a similar pair and taking the original sample and the hard negative sample as a dissimilar pair;
the loss minimizing unit is used for minimizing loss by using cosine similarity;
the training classifier unit is used for adding a classifier, taking the feature joint graph vector of the positive and negative sample pairs as the input of the classifier, training the classifier by using an activation function softmax to obtain a prediction label, and if the prediction label is 1, determining the prediction label as a vulnerability code, and if the prediction label is 0, determining the prediction label as a non-vulnerability code.
10. The source code vulnerability detection system based on comparative learning of claim 6, wherein the vulnerability detection module further comprises a graph generation unit, a detection unit, an output unit;
the graph generating unit is used for extracting vulnerability codes from a vulnerability database NVD and an open source code library GitHub as codes to be detected, and executing a feature learning module on a code file to be detected to generate a corresponding code attribute graph;
the detection unit is used for taking a code attribute graph obtained through code characteristic representation learning as the input of a vulnerability detection model to obtain a prediction tag of a code to be detected;
the output unit is used for outputting the prediction tag, if the prediction tag is 1, the code is a bug code, and if the prediction tag is 0, the code is a non-bug code.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210748624.6A CN115033890A (en) | 2022-06-29 | 2022-06-29 | Comparison learning-based source code vulnerability detection method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210748624.6A CN115033890A (en) | 2022-06-29 | 2022-06-29 | Comparison learning-based source code vulnerability detection method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115033890A true CN115033890A (en) | 2022-09-09 |
Family
ID=83126921
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210748624.6A Pending CN115033890A (en) | 2022-06-29 | 2022-06-29 | Comparison learning-based source code vulnerability detection method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115033890A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115658514A (en) * | 2022-10-28 | 2023-01-31 | 苏州棱镜七彩信息科技有限公司 | Vulnerability patch positioning method |
CN115859307A (en) * | 2022-12-26 | 2023-03-28 | 哈尔滨工业大学 | Similar vulnerability detection method based on tree attention and weighted graph matching |
CN116048454A (en) * | 2023-03-06 | 2023-05-02 | 山东师范大学 | Code rearrangement method and system based on iterative comparison learning |
CN117056940A (en) * | 2023-10-12 | 2023-11-14 | 中关村科学城城市大脑股份有限公司 | Method, device, electronic equipment and medium for repairing loopholes of server system |
CN117473510A (en) * | 2023-12-26 | 2024-01-30 | 南京邮电大学 | Automatic vulnerability discovery technology based on relationship between graph neural network and vulnerability patch |
CN118672594A (en) * | 2024-08-26 | 2024-09-20 | 山东大学 | Software defect prediction method and system |
-
2022
- 2022-06-29 CN CN202210748624.6A patent/CN115033890A/en active Pending
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115658514A (en) * | 2022-10-28 | 2023-01-31 | 苏州棱镜七彩信息科技有限公司 | Vulnerability patch positioning method |
CN115859307A (en) * | 2022-12-26 | 2023-03-28 | 哈尔滨工业大学 | Similar vulnerability detection method based on tree attention and weighted graph matching |
CN116048454A (en) * | 2023-03-06 | 2023-05-02 | 山东师范大学 | Code rearrangement method and system based on iterative comparison learning |
CN116048454B (en) * | 2023-03-06 | 2023-06-16 | 山东师范大学 | Code rearrangement method and system based on iterative comparison learning |
CN117056940A (en) * | 2023-10-12 | 2023-11-14 | 中关村科学城城市大脑股份有限公司 | Method, device, electronic equipment and medium for repairing loopholes of server system |
CN117056940B (en) * | 2023-10-12 | 2024-01-16 | 中关村科学城城市大脑股份有限公司 | Method, device, electronic equipment and medium for repairing loopholes of server system |
CN117473510A (en) * | 2023-12-26 | 2024-01-30 | 南京邮电大学 | Automatic vulnerability discovery technology based on relationship between graph neural network and vulnerability patch |
CN117473510B (en) * | 2023-12-26 | 2024-03-26 | 南京邮电大学 | Automatic vulnerability discovery technology based on relationship between graph neural network and vulnerability patch |
CN118672594A (en) * | 2024-08-26 | 2024-09-20 | 山东大学 | Software defect prediction method and system |
CN118672594B (en) * | 2024-08-26 | 2024-10-29 | 山东大学 | Software defect prediction method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111428044B (en) | Method, device, equipment and storage medium for acquiring supervision and identification results in multiple modes | |
CN115033890A (en) | Comparison learning-based source code vulnerability detection method and system | |
CN111639344B (en) | Vulnerability detection method and device based on neural network | |
Le et al. | Deep learning for source code modeling and generation: Models, applications, and challenges | |
CN110765265B (en) | Information classification extraction method and device, computer equipment and storage medium | |
CN107516041B (en) | WebShell detection method and system based on deep neural network | |
CN108664512B (en) | Text object classification method and device | |
WO2022174496A1 (en) | Data annotation method and apparatus based on generative model, and device and storage medium | |
CN115357904B (en) | Multi-class vulnerability detection method based on program slicing and graph neural network | |
CN114201406B (en) | Code detection method, system, equipment and storage medium based on open source component | |
CN116305158A (en) | Vulnerability identification method based on slice code dependency graph semantic learning | |
US12008341B2 (en) | Systems and methods for generating natural language using language models trained on computer code | |
CN112613040A (en) | Vulnerability detection method based on binary program and related equipment | |
CN113553052A (en) | Method for automatically recognizing security-related code submissions using an Attention-coded representation | |
CN117573084B (en) | Code complement method based on layer-by-layer fusion abstract syntax tree | |
CN117370980A (en) | Malicious code detection model generation and detection method, device, equipment and medium | |
CN110674497B (en) | Malicious program similarity calculation method and device | |
CN115129896B (en) | Network security emergency response knowledge graph relation extraction method based on comparison learning | |
CN116595537A (en) | Vulnerability detection method of generated intelligent contract based on multi-mode features | |
CN116702765A (en) | Event extraction method and device and electronic equipment | |
CN115859307A (en) | Similar vulnerability detection method based on tree attention and weighted graph matching | |
CN111562943B (en) | Code clone detection method and device based on event embedded tree and GAT network | |
CN116628695A (en) | Vulnerability discovery method and device based on multitask learning | |
CN111860662B (en) | Training method and device, application method and device of similarity detection model | |
Miao et al. | AST2Vec: A Robust Neural Code Representation for Malicious PowerShell Detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |