CN115033890A - Comparison learning-based source code vulnerability detection method and system - Google Patents

Comparison learning-based source code vulnerability detection method and system Download PDF

Info

Publication number
CN115033890A
CN115033890A CN202210748624.6A CN202210748624A CN115033890A CN 115033890 A CN115033890 A CN 115033890A CN 202210748624 A CN202210748624 A CN 202210748624A CN 115033890 A CN115033890 A CN 115033890A
Authority
CN
China
Prior art keywords
graph
vulnerability
code
unit
positive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210748624.6A
Other languages
Chinese (zh)
Inventor
孙小兵
闻身威
吴潇雪
薄莉莉
李斌
曹思聪
杨超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou University
Original Assignee
Yangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou University filed Critical Yangzhou University
Priority to CN202210748624.6A priority Critical patent/CN115033890A/en
Publication of CN115033890A publication Critical patent/CN115033890A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a source code vulnerability detection method based on contrast learning, which is characterized in that a required sample pair is generated by customizing a semantic retention heuristic method, meanwhile, vulnerability codes are expressed from the angle from a code token to a sentence to a graph, the deep semantic information of vulnerabilities can be better expressed by utilizing a code attribute graph, the relation between the vulnerability codes and the context is fully excavated, and the source code vulnerability detection system based on the contrast learning is correspondingly provided by adopting the contrast learning technology to train on small samples to realize the high-precision requirement of small sample data.

Description

Comparison learning-based source code vulnerability detection method and system
Technical Field
The invention relates to the field of software security, in particular to a source code vulnerability detection method and system based on comparison learning.
Background
Vulnerability detection is an important component in the software maintenance process. The existing deep learning model depends on a large number of marked vulnerability samples, the marked vulnerability samples are difficult to obtain in the actual development process, manual marking can cause a large amount of resource waste, and the correctness needs to be studied. The sample size of the existing labeled data set is not large enough to train a detection model with a high detection rate. Therefore, a detection model trained based on the existing labeled data set method has low detection precision and is difficult to be used for a vulnerability detection task in an actual environment.
There are some works that use the method of contrast LEARNING to detect software bugs AND their types, such as the literature "contrast LEARNING FOR software CODE WITH structured AND FUNCTIONAL PROPERTIES" that constructs bug samples by manually injecting bug CODEs, which still rely on a priori expert experience AND still require manual work therein. There Are also some works that use Deep Learning methods to detect software bugs, such as the document Deep Learning based Vulnerability Detection, Are We the Yet? Rich semantic information in a source code is not considered comprehensively, and vulnerability semantic features of different types are difficult to cover, so that the method has certain limitation in capturing deep semantic information of the code.
Disclosure of Invention
The invention aims to: the invention aims to provide a source code vulnerability detection method and system based on contrast learning, which can meet the high-precision requirement of small sample data, automatically generate required sample pairs by a customized semantic retention heuristic method, and fully mine the relation between vulnerability codes and context by utilizing semantic information of vulnerability deep level.
The technical scheme is as follows: the invention provides a source code vulnerability detection method based on comparison learning, which comprises the following steps:
1) acquiring vulnerability data in a vulnerability database NVD and an open source code library GitHub to construct a vulnerability data set, and performing data enhancement on the vulnerability data set to form positive and negative sample pairs;
2) performing code feature representation learning on the positive and negative sample pairs, converting the positive and negative sample pairs into a code attribute graph, and forming a joint graph by a control flow graph CFG (computational fluid dynamics) and a program dependence graph PDG (dependency graph) contained in the code attribute graph;
3) performing graph embedding on the joint graph and applying contrast learning to obtain feature joint graph vectors of positive and negative sample pairs, adding a classifier, and taking the feature joint graph vectors of the positive and negative sample pairs as the input of the classifier to obtain a prediction label so as to construct a vulnerability detection model;
4) and extracting vulnerability codes from the vulnerability database NVD and the open source code library GitHub to serve as codes to be detected, and analyzing whether the codes to be detected contain vulnerabilities or not by using a vulnerability detection model.
Further, in step 1), constructing a vulnerability data set for data enhancement, including the following steps:
1.1) collecting vulnerability data in a vulnerability database NVD and an open source code library Github, and extracting vulnerability files and corresponding vulnerability patch files;
1.2) preprocessing the extracted vulnerability file and the corresponding vulnerability patch file, extracting vulnerability codes in the vulnerability file and patch codes in the corresponding vulnerability patch file from the function level, and removing redundant information, wherein the redundant information comprises a header file, comments and declared global parameters to obtain a vulnerability data set;
1.3) selecting the vulnerability codes from the vulnerability data set as hard negative samples, using the corresponding patch codes as original samples, keeping the applied semantics of the original samples in a heuristic manner to change the syntactic structure of the original samples without changing the semantics, thereby generating positive samples, and combining the corresponding positive samples, the original samples and the hard negative samples as positive and negative sample pairs.
Further, the step 2) specifically comprises the following steps:
2.1) performing code characterization on all samples in each positive and negative sample pair, and generating a code attribute graph CPG for the positive and negative sample pairs through a tool Joern, wherein the code attribute graph CPG comprises an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG;
2.2) segmenting the encoding BPE by using bytes from leaf nodes of the abstract syntax tree AST, embedding words into the obtained code token tokens by using Word2vec, and generating sentence embedding vectors according to the hierarchical structure in the abstract syntax tree AST by using a tree-based convolutional neural network TBNN;
2.3) executing maximum pooling to aggregate the vectors of all nodes of the tree into a single vector containing corresponding statement semantics through a tree aggregation layer, and using the single vector as an initial feature vector of a corresponding statement node in a control flow graph CFG and a program dependency graph PDG;
2.4) combining the control flow graph CFG and the program dependence graph PDG into a joint graph, and inputting the vectors of the statements in the control flow graph CFG, the program dependence graph PDG and the joint graph into a code representation layer at the graph level.
Further, the step 3) specifically comprises the following steps:
3.1) calculating attention weight coefficients corresponding to all nodes in the joint graph according to three edges of control flow, control dependence and data dependence in a control flow graph CFG and a program dependence graph PDG;
3.2) the graph level code representation layer is composed of a plurality of graph attention layers, one graph attention layer corresponds to the type of one edge, and the graph attention layer updates the node vector representation in the joint graph according to the attention weight coefficient;
3.3) averaging the plurality of graph attention layers to form a multi-head attention layer to obtain the final vector representation of the nodes in the graph;
3.4) obtaining a characteristic joint graph vector of the positive and negative sample pairs according to different node importance in the joint graph;
3.5) taking the original sample and the positive sample as a similar pair, and taking the original sample and the hard negative sample as a dissimilar pair;
3.6) using cosine similarity to minimize loss;
3.7) adding a classifier, taking the feature joint graph vector of the positive and negative sample pairs as the input of the classifier, training the classifier by using an activation function softmax to obtain a prediction label, and if the prediction label is 1, determining the prediction label as a vulnerability code, and if the label is 0, determining the prediction label as a non-vulnerability code.
Further, the step 4) specifically comprises the following steps:
4.1) extracting vulnerability codes from a vulnerability database NVD and an open source code library GitHub to serve as codes to be detected, and executing the step 2) on the code files to be detected to generate corresponding code attribute graphs;
4.2) using a code attribute graph obtained through code characteristic representation learning as the input of a vulnerability detection model to obtain a prediction label of the code to be detected;
4.3) outputting a prediction label, wherein if the prediction label is 1, the prediction label is a bug code, and if the prediction label is 0, the prediction label is a non-bug code.
The invention correspondingly provides a source code vulnerability detection system based on comparison learning, which comprises a data set construction module, a feature learning module, a model construction module and a vulnerability detection module;
the data set construction module is used for collecting vulnerability data in a vulnerability database NVD and an open source code library GitHub to construct a vulnerability data set, and performing data enhancement on the vulnerability data set to form positive and negative sample pairs;
the characteristic learning module is used for carrying out code characteristic representation learning on the positive and negative sample pairs and converting the positive and negative sample pairs into a code attribute graph, and a control flow graph CFG and a program dependency graph PDG contained in the code attribute graph jointly form a joint graph;
the model building module is used for carrying out graph embedding on the joint graph and applying contrast learning to obtain feature joint graph vectors of positive and negative sample pairs, meanwhile, a classifier is added, and the feature joint graph vectors of the positive and negative sample pairs are used as the input of the classifier to obtain a prediction label so as to build a vulnerability detection model;
the vulnerability detection module is used for extracting vulnerability codes from a vulnerability database NVD and an open source code library Github to serve as codes to be detected, and analyzing whether the codes to be detected contain vulnerabilities or not by using a vulnerability detection model.
Furthermore, the data set construction module also comprises an acquisition unit, an extraction unit and a generation unit;
the collecting unit is used for collecting vulnerability data in a vulnerability database NVD and an open source code library GitHub, and extracting vulnerability files and corresponding vulnerability patch files;
the extraction unit is used for preprocessing the extracted vulnerability file and the corresponding vulnerability patch file, extracting vulnerability codes in the vulnerability file and patch codes in the corresponding vulnerability patch file from the function level, and removing redundant information, wherein the redundant information comprises a header file, comments and declared global parameters to obtain a vulnerability data set;
the generating unit is used for selecting the vulnerability codes from the vulnerability data set as hard negative samples, using the corresponding patch codes as original samples, keeping the application semantics of the original samples in a heuristic manner to change the syntactic structure of the original samples without changing the semantics, thereby generating positive samples, and combining the corresponding positive samples, the original samples and the hard negative samples as positive and negative sample pairs.
Furthermore, the feature learning module also comprises a representation unit, a statement embedding unit, a node aggregation unit and a graph embedding preparation unit;
the characterization unit is used for performing code characterization on all samples in each positive and negative sample pair, and generating a code attribute graph CPG for the positive and negative sample pairs through a tool Joern, wherein the code attribute graph CPG comprises an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG;
the sentence embedding unit is used for segmenting the encoding BPE by using bytes from leaf nodes of the abstract syntax tree AST, embedding words into the obtained code token tokens by using Word2vec, and generating sentence embedding vectors according to the hierarchical structure in the abstract syntax tree AST by using a tree-based convolutional neural network TBNN;
the node aggregation unit is used for executing maximum pooling to aggregate the vectors of all nodes of the tree into a single vector containing corresponding statement semantics through a tree aggregation layer, and the single vector is used as an initial feature vector of a corresponding statement node in a control flow graph CFG and a program dependency graph PDG;
the graph embedding preparation unit is used for combining the control flow graph CFG and the program dependency graph PDG into a joint graph, and inputting the vectors of the statements in the control flow graph CFG, the program dependency graph PDG and the joint graph into a graph-level code representation layer.
Furthermore, the model building module also comprises an attention weight calculating unit, a graph attention layer unit, a multi-head attention layer unit, a graph node aggregation unit, a comparison rule unit, a loss minimizing unit and a training classifier unit;
the attention weight calculation unit is used for calculating the attention weight coefficient corresponding to each node in the joint graph according to three edges of control flow, control dependence and data dependence in the control flow graph CFG and the program dependence graph PDG;
the graph attention layer unit is used for forming a plurality of graph attention layers according to a graph level code representation layer, one graph attention layer corresponds to the type of one edge, and the graph attention layer updates the node vector representation in the combined graph according to the attention weight coefficient;
the multi-head attention layer unit is used for averaging a plurality of graph attention layers to form a multi-head attention layer so as to obtain a final vector representation of a node in the graph;
the graph node aggregation unit is used for obtaining a feature joint graph vector of the positive and negative sample pairs according to different node importance in the joint graph;
the comparison rule unit is used for taking the original sample and the positive sample as a similar pair and taking the original sample and the hard negative sample as a dissimilar pair;
the loss minimizing unit is used for minimizing the loss by using the cosine similarity;
the training classifier unit is used for adding a classifier, taking the feature joint graph vector of the positive and negative sample pairs as the input of the classifier, training the classifier by using an activation function softmax to obtain a prediction label, and if the prediction label is 1, determining the prediction label as a vulnerability code, and if the prediction label is 0, determining the prediction label as a non-vulnerability code.
Furthermore, the vulnerability detection module also comprises a graph generation unit, a detection unit and an output unit;
the graph generating unit is used for extracting vulnerability codes from a vulnerability database NVD and an open source code library GitHub as codes to be detected, and executing a feature learning module on a code file to be detected to generate a corresponding code attribute graph;
the detection unit is used for taking a code attribute graph obtained through code characteristic representation learning as the input of a vulnerability detection model to obtain a prediction tag of a code to be detected;
the output unit is used for outputting the prediction tag, if the prediction tag is 1, the code is a bug code, and if the prediction tag is 0, the code is a non-bug code.
Has the advantages that: compared with the prior art, the method has the remarkable characteristics that the needed sample pairs are generated by customizing the semantic retention heuristic method, the vulnerability codes are expressed from the angles from the code tokens to the sentences to the graph, the deep semantic information of the vulnerability can be better expressed by utilizing the code attribute graph, the relation between the vulnerability codes and the context is fully excavated, and the high-precision requirement of the small sample data is realized by training on the small sample by adopting a contrast learning technology.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of the architecture of a tree-based convolutional neural network of the present invention;
FIG. 3 is a multi-headed graph attention layer in the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Example 1
The invention provides a source code vulnerability detection method based on comparative learning, which is shown in fig. 1 and comprises the following steps:
1) acquiring vulnerability data in a vulnerability database NVD and an open source code library GitHub to construct a vulnerability data set, and performing data enhancement on the vulnerability data set to form positive and negative sample pairs;
1.1) collecting vulnerability data in a vulnerability database NVD and an open source code library Github, and extracting vulnerability files and corresponding vulnerability patch files;
1.2) preprocessing the extracted vulnerability file and the corresponding vulnerability patch file, extracting vulnerability codes in the vulnerability file and patch codes in the corresponding vulnerability patch file from a function level, and removing redundant information, wherein the redundant information comprises a header file, comments and declared global parameters to obtain a vulnerability data set;
1.3) selecting vulnerability codes from a vulnerability data set as hard negative samples, using corresponding patch codes as original samples, and reserving the application semantics of the original samples to ensure that the syntax structure of the original samples is changed and the semantics are not changed, thereby generating positive samples, and combining the corresponding positive samples, the original samples and the hard negative samples as positive and negative sample pairs;
the semantic preservation heuristic rule comprises the following steps:
rule 1: introducing a new else branch, wherein the rule introduces a new else branch statement into the existing if conditional statement or reverses the condition in the if conditional statement;
rule 2: conditional expression with ANDing, for inclusion of condition C 1 And C 2 Is rewritten as a nested if structure;
rule 3: introducing a protection device at the outer layer of the for loop, and changing a program containing the for loop statement into a new program structure by adding an if (true) judgment statement at the outer layer of the loop;
rule 4: introducing a protection device at the outer layer of the while loop, adding a while judgment statement through the outer layer of the recycle loop, and changing the program containing the for loop statement into a new program structure;
rule 5: adding conditional break statement in while loop, wherein the expression is a condition C 1 The while loop of (2) can be changed into another infinite loop statement, and a C needs to be added into the loop statement body 1 A conditional break statement.
2) Performing code feature representation learning on the positive and negative sample pairs, converting the positive and negative sample pairs into a code attribute graph, and forming a joint graph by a control flow graph CFG (computational fluid dynamics) and a program dependence graph PDG (dependency graph) contained in the code attribute graph;
2.1) performing code characterization on all samples in each positive and negative sample pair, and generating a code attribute graph CPG for the positive and negative sample pairs through a tool Joern, wherein the code attribute graph CPG comprises an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG;
2.2) please refer to fig. 2, the leaf nodes of the abstract syntax tree AST are participled by using bytes to encode BPE, Word2vec is used to embed words into the obtained code token tokens, and a tree-based convolutional neural network TBCNN is used to generate sentence embedding vectors according to the hierarchical structure in the abstract syntax tree AST;
the convolutional neural network TBCNN based on the tree executes the following process:
the parent node representation is updated from the child node representations by sliding the convolution kernel of a triangle over the tree. Wherein the updated calculation formula is:
Figure BDA0003720420430000061
in the formula (I), the compound is shown in the specification,
Figure BDA0003720420430000062
is the updated parent node vector representation, H is the set of nodes in the triangle convolution kernel that contain the parent node itself,
Figure BDA0003720420430000063
is a vector representation of nodes in the H set, sigma represents an activation function, W q And b represents a weight matrix and a bias term, respectively; the main components of TBCNN include: representation and coding of vectors, tree-based convolution, dynamic pooling, a fully-connected concealment layer and an output layer;
2.3) through a tree aggregation layer, performing maximum pooling to aggregate vectors of all nodes of the tree into a single vector containing corresponding statement semantics, and using the single vector as an initial feature vector of a corresponding statement node in a control flow graph CFG and a program dependency graph PDG:
Figure BDA0003720420430000071
in the formula, h G Is CFG, the vector of nodes in PDG passing through tree aggregation layer represents set, TE () represents tree embedding process, V G Is a node set of CFG, PDG, AST n AST whose root node is n;
2.4) combining the control flow graph CFG and the program dependence graph PDG into a joint graph, and inputting the vectors of the statements in the control flow graph CFG, the program dependence graph PDG and the joint graph into a code representation layer at the graph level.
3) Performing graph embedding on the joint graph and applying contrast learning to obtain feature joint graph vectors of positive and negative sample pairs, adding a classifier, and taking the feature joint graph vectors of the positive and negative sample pairs as the input of the classifier to obtain a prediction label so as to construct a vulnerability detection model;
3.1) calculating attention weight coefficients corresponding to all nodes in the joint graph according to three edges of control flow, control dependence and data dependence in a control flow graph CFG and a program dependence graph PDG:
Figure BDA0003720420430000072
in the formula, N e The set of neighbors to node i under edge e (including node i itself),
Figure BDA0003720420430000073
for node i and node j under edge e (j belongs to N) e ) Attention weights in between, | | represents the stitching operation, exp () is an exponential function with a natural constant e as the base,
Figure BDA0003720420430000074
is a transpose of a learnable weight vector,
Figure BDA0003720420430000075
the association degree of the node i and the node j of the edge e can be considered, and the larger the value is, the stronger the association between the two nodes is, and the larger the attention weight coefficient is;
3.2) the graph level code representation layer is composed of a plurality of graph attention layers, one graph attention layer corresponds to the type of one edge, and the graph attention layer updates the node vector representation in the joint graph according to the attention weight coefficient obtained in the step 3.1):
Figure BDA0003720420430000076
in the formula (I), the compound is shown in the specification,
Figure BDA0003720420430000077
for vector characterization of nodes, σ is the activation function, W e Is the weight coefficient of edge e;
3.3) referring to FIG. 3, averaging multiple graph attention levels to form a multi-head attention level, and obtaining a final vector characterization of nodes in the graph:
Figure BDA0003720420430000081
the multi-head attention mechanism splices or averages the features of each head to obtain a new feature representation;
3.4) obtaining a feature joint graph vector of the positive and negative sample pairs according to different node importance in the joint graph:
Figure BDA0003720420430000082
wherein AGG is the polymerization function, V G Set of nodes of the graph structure, α n The attention weight for node n is expressed as:
Figure BDA0003720420430000083
in the formula (I), the compound is shown in the specification,
Figure BDA0003720420430000084
is a transpose of a learnable weight vector;
3.5) taking the original sample and the positive sample as a similar pair, and taking the original sample and the hard negative sample as a dissimilar pair;
3.6) the cosine similarity is used to minimize the loss, and the specific comparison learning loss is as follows:
Figure BDA0003720420430000085
in the formula, sim () is a cosine similarity function, N is the number of positive and negative sample pairs, h is a vector representation of an original sample graph, and h is + For vector characterization of the positive sample plot, h - For vector characterization of the negative sample graph, τ is a temperature parameter;
3.7) adding a classifier, taking the feature joint graph vector of the positive and negative sample pairs as the input of the classifier, training the classifier by using an activation function softmax to obtain a predicted label, and if the predicted label is 1, determining the predicted label as a loophole code, and if the predicted label is 0, determining the predicted label as a non-loophole code.
4) And extracting vulnerability codes from the vulnerability database NVD and the open source code library GitHub to serve as codes to be detected, and analyzing whether the codes to be detected contain vulnerabilities or not by using a vulnerability detection model.
4.1) extracting vulnerability codes from a vulnerability database NVD and an open source code library GitHub to serve as codes to be detected, and executing the step 2) on the code files to be detected to generate corresponding code attribute graphs;
4.2) using a code attribute graph obtained through code characteristic representation learning as the input of a vulnerability detection model to obtain a prediction label of the code to be detected;
4.3) outputting a prediction label, wherein if the prediction label is 1, the prediction label is a bug code, and if the prediction label is 0, the prediction label is a non-bug code.
Example 2
Corresponding to the source code vulnerability detection method based on the comparative learning in embodiment 1, in embodiment 2, a source code vulnerability detection system based on the comparative learning is provided, which includes a data set construction module, a feature learning module, a model construction module, and a vulnerability detection module, as shown in fig. 1;
the data set construction module is used for collecting vulnerability data in a vulnerability database NVD and an open source code library GitHub to construct a vulnerability data set, and performing data enhancement on the vulnerability data set to form positive and negative sample pairs;
the data set building module also comprises an acquisition unit, an extraction unit and a generation unit;
the collecting unit is used for collecting vulnerability data in a vulnerability database NVD and an open source code library GitHub, and extracting vulnerability files and corresponding vulnerability patch files;
the extraction unit is used for preprocessing the extracted vulnerability file and the corresponding vulnerability patch file, extracting vulnerability codes in the vulnerability file and patch codes in the corresponding vulnerability patch file from the function level, and removing redundant information, wherein the redundant information comprises a header file, comments and declared global parameters to obtain a vulnerability data set;
the generating unit is used for selecting vulnerability codes from the vulnerability data set as hard negative samples, corresponding patch codes as original samples, and preserving the application semantics of the original samples in a heuristic manner to change the syntactic structure of the original samples without changing the semantics, so as to generate positive samples, and combining the corresponding positive samples, the original samples and the hard negative samples as positive and negative sample pairs;
wherein, the semantic preservation heuristic rule is as follows:
rule 1: introducing a new else branch, wherein the rule introduces a new else branch statement into the existing if conditional statement or reverses the condition in the if conditional statement;
rule 2: conditional expression with ANDing, for inclusion of condition C 1 And C 2 Is rewritten as a nested if structure;
rule 3: introducing a protection device at the outer layer of the for loop, and changing the program containing the for loop statement into a new program structure by adding an if (true) judgment statement at the outer layer of the loop;
rule 4: introducing a protection device at the outer layer of the while loop, adding a while judgment statement through the outer layer of the recycle loop, and changing the program containing the for loop statement into a new program structure;
rule 5: adding conditional break statement in while loop, wherein the expression is a condition C 1 The while loop of (2) can be changed into another infinite loop statement, and a C needs to be added into the loop statement body 1 A conditional break statement.
The characteristic learning module is used for carrying out code characteristic representation learning on the positive and negative sample pairs and converting the positive and negative sample pairs into a code attribute graph, and a control flow graph CFG and a program dependency graph PDG contained in the code attribute graph jointly form a joint graph;
the feature learning module also comprises a representation unit, a statement embedding unit, a node aggregation unit and a graph embedding preparation unit;
the characterization unit is used for performing code characterization on all samples in each positive and negative sample pair, and generating a code attribute graph CPG for the positive and negative sample pairs through a tool Joern, wherein the code attribute graph CPG comprises an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG;
referring to fig. 2, the sentence embedding unit is configured to divide the leaf nodes of the abstract syntax tree AST into words by using bytes for encoding BPE, perform Word embedding on the obtained code token tokens by using Word2vec, and generate a sentence embedding vector according to the hierarchical structure in the abstract syntax tree AST by using a tree-based convolutional neural network TBCNN;
the convolutional neural network TBCNN based on the tree executes the following process:
the parent node representation is updated from the child node representations by sliding the convolution kernels of one triangle over the tree. Wherein the updated calculation formula is:
Figure BDA0003720420430000101
in the formula (I), the compound is shown in the specification,
Figure BDA0003720420430000102
is the updated parent node vector representation, H is the set of nodes in the triangular convolution kernel that contain the parent node itself,
Figure BDA0003720420430000103
is a vector representation of nodes in the H set, sigma represents an activation function, W q And b represents a weight matrix and a bias term, respectively; the main components of TBCNN include: representation and coding of vectors, tree-based convolution, dynamic pooling, a fully-connected concealment layer and an output layer;
the node aggregation unit is used for executing maximum pooling to aggregate vectors of all nodes of the tree into a single vector containing corresponding statement semantics through a tree aggregation layer, and the single vector is used as an initial feature vector of a corresponding statement node in the control flow graph CFG and the program dependency graph PDG:
Figure BDA0003720420430000104
in the formula, h G Is CFG, the vector of nodes in PDG passing through tree aggregation layer represents set, TE () represents tree embedding process, V G Is a node set of CFG, PDG, AST n AST whose root node is n;
the graph embedding preparation unit is used for combining the control flow graph CFG and the program dependency graph PDG into a joint graph, and inputting the vectors of the statements in the control flow graph CFG, the program dependency graph PDG and the joint graph into a graph-level code representation layer.
The model building module is used for carrying out graph embedding on the joint graph and applying contrast learning to obtain feature joint graph vectors of positive and negative sample pairs, meanwhile, a classifier is added, the feature joint graph vectors of the positive and negative sample pairs are used as the input of the classifier to obtain a prediction label, and therefore a vulnerability detection model is built;
the model building module also comprises an attention weight calculating unit, a graph attention layer unit, a multi-head attention layer unit, a graph node aggregation unit, a comparison rule unit, a loss minimizing unit and a training classifier unit;
the attention weight calculation unit is used for calculating the attention weight coefficient corresponding to each node in the joint graph according to three edges of control flow, control dependence and data dependence in the control flow graph CFG and the program dependence graph PDG:
Figure BDA0003720420430000111
in the formula, N e The set of neighbors of node i under edge e (including node i itself),
Figure BDA0003720420430000112
for node i and node j under edge e (j belongs to N) e ) Attention weights in between, | | represents the stitching operation, exp () is an exponential function with a natural constant e as the base,
Figure BDA0003720420430000113
is a transpose of a learnable weight vector,
Figure BDA0003720420430000114
the greater the value of the association degree between the node i and the node j of the edge e, the stronger the association between the two nodes, and the greater the attention weight coefficient;
the graph attention layer unit is used for forming a plurality of graph attention layers according to a graph level code representation layer, one graph attention layer corresponds to the type of one edge, and the graph attention layer updates the node vector representation in the combined graph according to the attention weight coefficient:
Figure BDA0003720420430000115
in the formula (I), the compound is shown in the specification,
Figure BDA0003720420430000116
for vector characterization of nodes, σ is the activation function, W e Is the weight coefficient of edge e;
the multi-head attention layer unit is used for averaging a plurality of graph attention layers to form a multi-head attention layer, and obtaining a final vector representation of a node in the graph:
Figure BDA0003720420430000117
the multi-head attention mechanism splices or averages the features of each head to obtain a new feature representation;
the graph node aggregation unit is used for obtaining a feature joint graph vector of the positive and negative sample pairs according to different node importance in the joint graph:
Figure BDA0003720420430000118
wherein AGG is the polymerization function, V G Set of nodes of the graph structure, α n The attention weight for node n is expressed as:
Figure BDA0003720420430000121
in the formula (I), the compound is shown in the specification,
Figure BDA0003720420430000122
is a transpose of a learnable weight vector;
the comparison rule unit is used for taking the original sample and the positive sample as a similar pair and taking the original sample and the hard negative sample as a dissimilar pair;
the loss minimization unit is used for minimizing loss by using cosine similarity, and the specific comparison learning loss is as follows:
Figure BDA0003720420430000123
in the formula, sim () is a cosine similarity function, N is the number of positive and negative sample pairs, h is a vector representation of an original sample graph, and h is + For vector characterization of the positive sample plot, h - For vector characterization of the negative sample graph, τ is a temperature parameter;
the training classifier unit is used for adding a classifier, taking the feature joint graph vector of the positive and negative sample pairs as the input of the classifier, training the classifier by using an activation function softmax to obtain a prediction label, and if the prediction label is 1, determining the prediction label as a bug code, and if the prediction label is 0, determining the prediction label as a non-bug code.
The vulnerability detection module extracts vulnerability codes from a vulnerability database NVD and an open source code library GitHub as codes to be detected, and analyzes whether the codes to be detected contain vulnerabilities or not by using a vulnerability detection model;
the vulnerability detection module also comprises a graph generation unit, a detection unit and an output unit;
the graph generating unit is used for extracting vulnerability codes from a vulnerability database NVD and an open source code library GitHub as codes to be detected, and executing a feature learning module on a code file to be detected to generate a corresponding code attribute graph;
the detection unit is used for taking a code attribute graph obtained through code characteristic representation learning as the input of a vulnerability detection model to obtain a prediction label of a code to be detected;
the output unit is used for outputting the prediction label, if the prediction label is 1, the code is a bug code, and if the prediction label is 0, the code is a non-bug code.

Claims (10)

1. A source code vulnerability detection method based on comparative learning is characterized by comprising the following steps:
1) acquiring vulnerability data in a vulnerability database NVD and an open source code library GitHub to construct a vulnerability data set, and performing data enhancement on the vulnerability data set to form positive and negative sample pairs;
2) performing code feature representation learning on the positive and negative sample pairs, converting the positive and negative sample pairs into a code attribute graph, and forming a joint graph by a control flow graph CFG and a program dependency graph PDG which are contained in the code attribute graph;
3) performing graph embedding on the joint graph and applying contrast learning to obtain feature joint graph vectors of positive and negative sample pairs, adding a classifier, and taking the feature joint graph vectors of the positive and negative sample pairs as the input of the classifier to obtain a prediction label so as to construct a vulnerability detection model;
4) and extracting vulnerability codes from the vulnerability database NVD and the open source code library Github as codes to be detected, and analyzing whether the codes to be detected contain vulnerabilities or not by using a vulnerability detection model.
2. The source code vulnerability detection method based on comparative learning according to claim 1, wherein the step 1) specifically comprises the following steps:
1.1) collecting vulnerability data in a vulnerability database NVD and an open source code library Github, and extracting vulnerability files and corresponding vulnerability patch files;
1.2) preprocessing the extracted vulnerability file and the corresponding vulnerability patch file, extracting vulnerability codes in the vulnerability file and patch codes in the corresponding vulnerability patch file from the function level, and removing redundant information, wherein the redundant information comprises a header file, comments and declared global parameters to obtain a vulnerability data set;
1.3) selecting the vulnerability codes from the vulnerability data set as hard negative samples, using the corresponding patch codes as original samples, keeping the applied semantics of the original samples in a heuristic manner to change the syntactic structure of the original samples without changing the semantics, thereby generating positive samples, and combining the corresponding positive samples, the original samples and the hard negative samples as positive and negative sample pairs.
3. The source code vulnerability detection method based on comparative learning according to claim 2, wherein the step 2) specifically comprises the following steps:
2.1) performing code characterization on all samples in each positive and negative sample pair, and generating a code attribute graph CPG for the positive and negative sample pairs through a tool Joern, wherein the code attribute graph CPG comprises an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG;
2.2) segmenting the encoding BPE by using bytes from leaf nodes of the abstract syntax tree AST, embedding words into the obtained code token tokens by using Word2vec, and generating sentence embedding vectors according to the hierarchical structure in the abstract syntax tree AST by using a tree-based convolutional neural network TBNN;
2.3) through a tree aggregation layer, performing maximum pooling to aggregate vectors of all nodes of the tree into a single vector containing corresponding statement semantics, and using the single vector as an initial feature vector of a corresponding statement node in a control flow graph CFG and a program dependency graph PDG;
2.4) combining the control flow graph CFG and the program dependence graph PDG into a joint graph, and inputting the vectors of the statements in the control flow graph CFG, the program dependence graph PDG and the joint graph into a code representation layer at the graph level.
4. The source code vulnerability detection method based on comparative learning according to claim 3, wherein the step 3) specifically comprises the following steps:
3.1) calculating attention weight coefficients corresponding to all nodes in the joint graph according to three edges of control flow, control dependence and data dependence in a control flow graph CFG and a program dependence graph PDG;
3.2) the graph level code representation layer is composed of a plurality of graph attention layers, one graph attention layer corresponds to the type of one edge, and the graph attention layer updates the node vector representation in the joint graph according to the attention weight coefficient;
3.3) averaging the plurality of graph attention layers to form a multi-head attention layer to obtain the final vector representation of the nodes in the graph;
3.4) obtaining a feature joint graph vector of the positive and negative sample pairs according to different node importance in the joint graph;
3.5) taking the original sample and the positive sample as a similar pair, and taking the original sample and the hard negative sample as a dissimilar pair;
3.6) using cosine similarity to minimize loss;
3.7) adding a classifier, taking the feature joint graph vector of the positive and negative sample pairs as the input of the classifier, training the classifier by using an activation function softmax to obtain a predicted label, and if the predicted label is 1, determining the predicted label as a loophole code, and if the predicted label is 0, determining the predicted label as a non-loophole code.
5. The source code vulnerability detection method based on comparative learning according to claim 1, wherein the step 4) specifically comprises the following steps:
4.1) extracting vulnerability codes from a vulnerability database NVD and an open source code library GitHub to serve as codes to be detected, and executing the step 2) on the code files to be detected to generate corresponding code attribute graphs;
4.2) using a code attribute graph obtained through code characteristic representation learning as the input of a vulnerability detection model to obtain a prediction label of the code to be detected;
4.3) outputting a prediction label, wherein if the prediction label is 1, the prediction label is a bug code, and if the prediction label is 0, the prediction label is a non-bug code.
6. A source code vulnerability detection system based on contrast learning is characterized by comprising a data set construction module, a feature learning module, a model construction module and a vulnerability detection module;
the data set construction module is used for collecting vulnerability data in a vulnerability database NVD and an open source code library GitHub to construct a vulnerability data set, and performing data enhancement on the vulnerability data set to form positive and negative sample pairs;
the characteristic learning module is used for carrying out code characteristic representation learning on the positive and negative sample pairs and converting the positive and negative sample pairs into a code attribute graph, and a control flow graph CFG and a program dependency graph PDG contained in the code attribute graph jointly form a joint graph;
the model building module is used for carrying out graph embedding on the joint graph and applying contrast learning to obtain feature joint graph vectors of positive and negative sample pairs, meanwhile, a classifier is added, and the feature joint graph vectors of the positive and negative sample pairs are used as the input of the classifier to obtain a prediction label so as to build a vulnerability detection model;
the vulnerability detection module is used for extracting vulnerability codes from the vulnerability database NVD and the open source code library GitHub to serve as codes to be detected, and analyzing whether the codes to be detected contain vulnerabilities or not by using a vulnerability detection model.
7. The source code vulnerability detection system based on comparative learning of claim 6, wherein the data set construction module further comprises a collection unit, an extraction unit, a generation unit;
the collecting unit is used for collecting vulnerability data in a vulnerability database NVD and an open source code library GitHub, and extracting vulnerability files and corresponding vulnerability patch files;
the extraction unit is used for preprocessing the extracted vulnerability file and the corresponding vulnerability patch file, extracting vulnerability codes in the vulnerability file and patch codes in the corresponding vulnerability patch file from the function level, and removing redundant information, wherein the redundant information comprises a header file, comments and declared global parameters to obtain a vulnerability data set;
the generating unit is used for selecting the vulnerability codes from the vulnerability data set as hard negative samples, corresponding patch codes are used as original samples, the applied semantics of the original samples are kept in a heuristic mode to enable the grammar structure of the original samples to be changed and the semantics to be unchanged, therefore, the positive samples are generated, and the corresponding positive samples, the original samples and the hard negative samples are combined to be used as positive and negative sample pairs.
8. The source code vulnerability detection system based on comparative learning of claim 7, wherein the feature learning module further comprises a characterization unit, a sentence embedding unit, a node aggregation unit, a graph embedding preparation unit;
the characterization unit is used for performing code characterization on all samples in each positive and negative sample pair, and generating a code attribute graph CPG for the positive and negative sample pairs through a tool Joern, wherein the code attribute graph CPG comprises an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG;
the sentence embedding unit is used for segmenting the encoding BPE by using bytes from leaf nodes of the abstract syntax tree AST, embedding words into the obtained code token tokens by using Word2vec, and generating sentence embedding vectors according to the hierarchical structure in the abstract syntax tree AST by using a tree-based convolutional neural network TBNN;
the node aggregation unit is used for executing maximum pooling to aggregate the vectors of all nodes of the tree into a single vector containing corresponding statement semantics through a tree aggregation layer, and the single vector is used as an initial feature vector of a corresponding statement node in a control flow graph CFG and a program dependency graph PDG;
the graph embedding preparation unit is used for combining the control flow graph CFG and the program dependency graph PDG into a joint graph, and inputting the vectors of the statements in the control flow graph CFG, the program dependency graph PDG and the joint graph into a graph-level code representation layer.
9. The source code vulnerability detection system based on comparative learning of claim 8, wherein the model construction module further comprises an attention weight calculation unit, a graph attention layer unit, a multi-head attention layer unit, a graph node aggregation unit, a comparative rule unit, a loss minimization unit, a training classifier unit;
the attention weight calculation unit is used for calculating the attention weight coefficient corresponding to each node in the joint graph according to three edges of control flow, control dependence and data dependence in a control flow graph CFG and a program dependence graph PDG;
the graph attention layer unit is used for forming a plurality of graph attention layers according to a graph level code representation layer, one graph attention layer corresponds to the type of one edge, and the graph attention layer updates the node vector representation in the combined graph according to the attention weight coefficient;
the multi-head attention layer unit is used for averaging a plurality of graph attention layers to form a multi-head attention layer so as to obtain the final vector representation of the nodes in the graph;
the graph node aggregation unit is used for obtaining a feature joint graph vector of the positive and negative sample pairs according to different node importance in the joint graph;
the comparison rule unit is used for taking the original sample and the positive sample as a similar pair and taking the original sample and the hard negative sample as a dissimilar pair;
the loss minimizing unit is used for minimizing loss by using cosine similarity;
the training classifier unit is used for adding a classifier, taking the feature joint graph vector of the positive and negative sample pairs as the input of the classifier, training the classifier by using an activation function softmax to obtain a prediction label, and if the prediction label is 1, determining the prediction label as a vulnerability code, and if the prediction label is 0, determining the prediction label as a non-vulnerability code.
10. The source code vulnerability detection system based on comparative learning of claim 6, wherein the vulnerability detection module further comprises a graph generation unit, a detection unit, an output unit;
the graph generating unit is used for extracting vulnerability codes from a vulnerability database NVD and an open source code library GitHub as codes to be detected, and executing a feature learning module on a code file to be detected to generate a corresponding code attribute graph;
the detection unit is used for taking a code attribute graph obtained through code characteristic representation learning as the input of a vulnerability detection model to obtain a prediction tag of a code to be detected;
the output unit is used for outputting the prediction tag, if the prediction tag is 1, the code is a bug code, and if the prediction tag is 0, the code is a non-bug code.
CN202210748624.6A 2022-06-29 2022-06-29 Comparison learning-based source code vulnerability detection method and system Pending CN115033890A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210748624.6A CN115033890A (en) 2022-06-29 2022-06-29 Comparison learning-based source code vulnerability detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210748624.6A CN115033890A (en) 2022-06-29 2022-06-29 Comparison learning-based source code vulnerability detection method and system

Publications (1)

Publication Number Publication Date
CN115033890A true CN115033890A (en) 2022-09-09

Family

ID=83126921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210748624.6A Pending CN115033890A (en) 2022-06-29 2022-06-29 Comparison learning-based source code vulnerability detection method and system

Country Status (1)

Country Link
CN (1) CN115033890A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115658514A (en) * 2022-10-28 2023-01-31 苏州棱镜七彩信息科技有限公司 Vulnerability patch positioning method
CN115859307A (en) * 2022-12-26 2023-03-28 哈尔滨工业大学 Similar vulnerability detection method based on tree attention and weighted graph matching
CN116048454A (en) * 2023-03-06 2023-05-02 山东师范大学 Code rearrangement method and system based on iterative comparison learning
CN117056940A (en) * 2023-10-12 2023-11-14 中关村科学城城市大脑股份有限公司 Method, device, electronic equipment and medium for repairing loopholes of server system
CN117473510A (en) * 2023-12-26 2024-01-30 南京邮电大学 Automatic vulnerability discovery technology based on relationship between graph neural network and vulnerability patch
CN118672594A (en) * 2024-08-26 2024-09-20 山东大学 Software defect prediction method and system

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115658514A (en) * 2022-10-28 2023-01-31 苏州棱镜七彩信息科技有限公司 Vulnerability patch positioning method
CN115859307A (en) * 2022-12-26 2023-03-28 哈尔滨工业大学 Similar vulnerability detection method based on tree attention and weighted graph matching
CN116048454A (en) * 2023-03-06 2023-05-02 山东师范大学 Code rearrangement method and system based on iterative comparison learning
CN116048454B (en) * 2023-03-06 2023-06-16 山东师范大学 Code rearrangement method and system based on iterative comparison learning
CN117056940A (en) * 2023-10-12 2023-11-14 中关村科学城城市大脑股份有限公司 Method, device, electronic equipment and medium for repairing loopholes of server system
CN117056940B (en) * 2023-10-12 2024-01-16 中关村科学城城市大脑股份有限公司 Method, device, electronic equipment and medium for repairing loopholes of server system
CN117473510A (en) * 2023-12-26 2024-01-30 南京邮电大学 Automatic vulnerability discovery technology based on relationship between graph neural network and vulnerability patch
CN117473510B (en) * 2023-12-26 2024-03-26 南京邮电大学 Automatic vulnerability discovery technology based on relationship between graph neural network and vulnerability patch
CN118672594A (en) * 2024-08-26 2024-09-20 山东大学 Software defect prediction method and system
CN118672594B (en) * 2024-08-26 2024-10-29 山东大学 Software defect prediction method and system

Similar Documents

Publication Publication Date Title
CN111428044B (en) Method, device, equipment and storage medium for acquiring supervision and identification results in multiple modes
CN115033890A (en) Comparison learning-based source code vulnerability detection method and system
CN111639344B (en) Vulnerability detection method and device based on neural network
Le et al. Deep learning for source code modeling and generation: Models, applications, and challenges
CN110765265B (en) Information classification extraction method and device, computer equipment and storage medium
CN107516041B (en) WebShell detection method and system based on deep neural network
CN108664512B (en) Text object classification method and device
WO2022174496A1 (en) Data annotation method and apparatus based on generative model, and device and storage medium
CN115357904B (en) Multi-class vulnerability detection method based on program slicing and graph neural network
CN114201406B (en) Code detection method, system, equipment and storage medium based on open source component
CN116305158A (en) Vulnerability identification method based on slice code dependency graph semantic learning
US12008341B2 (en) Systems and methods for generating natural language using language models trained on computer code
CN112613040A (en) Vulnerability detection method based on binary program and related equipment
CN113553052A (en) Method for automatically recognizing security-related code submissions using an Attention-coded representation
CN117573084B (en) Code complement method based on layer-by-layer fusion abstract syntax tree
CN117370980A (en) Malicious code detection model generation and detection method, device, equipment and medium
CN110674497B (en) Malicious program similarity calculation method and device
CN115129896B (en) Network security emergency response knowledge graph relation extraction method based on comparison learning
CN116595537A (en) Vulnerability detection method of generated intelligent contract based on multi-mode features
CN116702765A (en) Event extraction method and device and electronic equipment
CN115859307A (en) Similar vulnerability detection method based on tree attention and weighted graph matching
CN111562943B (en) Code clone detection method and device based on event embedded tree and GAT network
CN116628695A (en) Vulnerability discovery method and device based on multitask learning
CN111860662B (en) Training method and device, application method and device of similarity detection model
Miao et al. AST2Vec: A Robust Neural Code Representation for Malicious PowerShell Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination