CN112560049A - Vulnerability detection method and device and storage medium - Google Patents

Vulnerability detection method and device and storage medium Download PDF

Info

Publication number
CN112560049A
CN112560049A CN202011577796.9A CN202011577796A CN112560049A CN 112560049 A CN112560049 A CN 112560049A CN 202011577796 A CN202011577796 A CN 202011577796A CN 112560049 A CN112560049 A CN 112560049A
Authority
CN
China
Prior art keywords
neural network
cfg
data
training
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011577796.9A
Other languages
Chinese (zh)
Inventor
冯继强
蒋磊
朱鲲鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Aurora Infinite Information Technology Co ltd
Original Assignee
Suzhou Aurora Infinite Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Aurora Infinite Information Technology Co ltd filed Critical Suzhou Aurora Infinite Information Technology Co ltd
Priority to CN202011577796.9A priority Critical patent/CN112560049A/en
Publication of CN112560049A publication Critical patent/CN112560049A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a vulnerability detection method, a vulnerability detection device and a storage medium, wherein the method comprises the following steps: acquiring data to be detected; the data to be detected is assembly data included in the binary program file; identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected; the preset detection model is obtained based on BERT neural network, graph neural network and classifier training; the BERT neural network is used for identifying assembly instruction features in the CFG; the graph neural network is used for identifying graph structure characteristics of the CFG; and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.

Description

Vulnerability detection method and device and storage medium
Technical Field
The present invention relates to the field of computers, and in particular, to a vulnerability detection method, device and storage medium.
Background
The existing technical scheme for vulnerability detection comprises the following steps: source code review, fuzz testing, dynamic and static analysis, and the like.
Source code inspection is a vulnerability detection technique that is performed if software source code has been obtained. Security technologists are typically required to inspect the software source code. By means of source code examination, potential safety hazards such as memory leakage, stack overflow and the like can be found.
Fuzz testing is a software testing technique that inputs automatically or semi-automatically generated random data into a program, monitors the program for anomalies such as crashes, assertion failures, etc. to discover possible program errors.
Static analysis refers to scanning program codes by lexical analysis, syntactic analysis, control flow, data flow analysis, and other techniques to find various defects of a program without running the code. The static program analysis can help software developers and quality assurance personnel to find structural errors, security holes and other problems in the codes, so that the overall quality of the software is guaranteed.
Dynamic analysis refers to the manner in which analysis is performed by executing a program on a processor. In order for the dynamic analysis to be effective, the target program must be executed using enough test inputs to cover almost all possible outputs.
The above solutions have different drawbacks, respectively, in particular:
and (3) source code examination: requiring a detailed review of the source code by a security specialist or developer, the time spent by the developer on each function is almost doubled, and for some very large projects, a detailed review of the source code is an almost impossible task.
And (3) fuzzy testing: fuzz testing is currently the most widely used way to detect vulnerabilities, but is performed very slowly, usually in months, and is limited to vulnerabilities that lead to crashes.
Static analysis: very slow, at least in days, requires experts in safety and is limited to given regulations and difficult to scale.
Dynamic analysis: it can be performed with partial aid of tools, but is generally difficult to implement and has a limited range of applications, since it is a very difficult task to cover all possible outputs.
Disclosure of Invention
In view of the above, the present invention provides a vulnerability detection method, apparatus and storage medium.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the embodiment of the invention provides a vulnerability detection method, which comprises the following steps:
acquiring data to be detected; the data to be detected is assembly data included in the binary program file;
identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected;
the preset detection model is obtained based on BERT neural network, graph neural network and classifier training;
the BERT neural network is used for identifying assembly instruction features in the CFG;
the graph neural network is used for identifying graph structure characteristics of the CFG;
and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.
Further, the method further comprises: training a BERT neural network, a training graph neural network and a training classifier; alternatively, the method comprises: jointly training a BERT neural network, a graph neural network and a classifier.
Further, the training BERT neural network includes:
acquiring a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; the CFG includes: multiple CFG nodes and edges between CFG nodes; the CFG node is an assembly instruction, and the edge is a conversion instruction; the CFG is obtained by splitting a program into determined functions by using a disassembling program;
converting tokens of assembly instructions in CFG nodes into one-hot codes, inputting the tokens into a BERT neural network, and training by using MLM as a task to obtain the trained BERT neural network as a part of a preset detection model;
wherein training the BERT neural network comprises: initial training and secondary training;
the initial training includes: in the BERT neural network, an input layer randomly shields input with a preset proportion, and an output layer predicts original input so as to execute initial training on the network;
the secondary training includes: after obtaining the characteristics of each CFG node, the prediction task of the next CFG node is performed to take the upstream and downstream nodes together with the CFG nodes constructed randomly as input to predict whether the paired CFG nodes actually have upstream and downstream relationships.
Further, the training graph neural network includes:
acquiring a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; each CFG node is embedded with a defined node; the embedded token vector;
and inputting the label corresponding to each CFG and CFG in the training data set into a graph neural network, and training the graph neural network to obtain the trained graph neural network.
Further, the training classifier includes:
inputting the assembly instruction features, the graph structure features and the labels corresponding to the CFGs, which are obtained by processing the CFGs in the training data set, into a fully-connected neural network, training the fully-connected neural network, and obtaining the trained fully-connected neural network as a classifier.
Further, the classifier adopts an encoder model stacked by a multilayer perception Machine (MLP) layer, and performs classification training by using a softmax activation function; the trained classifier may be used to determine the location of the vulnerability data.
Further, after the data to be detected is acquired, the method further includes: converting the data to be detected into at least one CFG to be detected;
correspondingly, the identifying the data to be detected by using a preset detection model and determining the vulnerability data in the data to be detected comprise:
and identifying the at least one CFG to be detected by using a preset detection model, and determining vulnerability data in the data to be detected.
The embodiment of the invention provides a vulnerability detection device, which comprises:
the acquisition module is used for acquiring data to be detected; the data to be detected is assembly data included in the binary program file;
the detection module is used for identifying the data to be detected by using a preset detection model and determining vulnerability data in the data to be detected;
the preset detection model is obtained based on BERT neural network, graph neural network and classifier training;
the BERT neural network is used for identifying assembly instruction features in the CFG;
the graph neural network is used for identifying graph structure characteristics of the CFG;
and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.
The embodiment of the invention provides a vulnerability detection device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the steps of the vulnerability detection method.
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of any one of the vulnerability detection methods described above.
The embodiment of the invention provides a vulnerability detection method, which comprises the following steps: acquiring data to be detected; the data to be detected is assembly data included in the binary program file; identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected; the preset detection model is obtained based on BERT neural network, graph neural network and classifier training; the BERT neural network is used for identifying assembly instruction features in the CFG; the graph neural network is used for identifying graph structure characteristics of the CFG; the classifier is used for identifying vulnerability data based on the assembly instruction features and the graph structure features; therefore, rapid, efficient and accurate vulnerability detection is realized.
Drawings
Fig. 1 is a schematic diagram of a vulnerability detection method according to an embodiment of the present invention;
FIG. 2 is a diagram of a vocabulary according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a DUGNN model architecture according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a vulnerability detection apparatus according to an embodiment of the present invention;
fig. 5 is a schematic diagram of another vulnerability detection apparatus according to an embodiment of the present invention.
Detailed Description
The method provided by the embodiment of the invention obtains the data to be detected; the data to be detected is assembly data included in the binary program file; identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected; the preset detection model is obtained based on BERT neural network, graph neural network and classifier training; the BERT neural network is used for identifying assembly instruction features in the CFG; the graph neural network is used for identifying graph structure characteristics of the CFG; and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.
The present invention will be described in further detail with reference to examples.
Fig. 1 is a schematic flowchart of a vulnerability detection method according to an embodiment of the present invention; as shown in fig. 1, the present invention is applied to electronic devices, such as computers, servers, and the like; the method comprises the following steps:
step 101, acquiring data to be detected; the data to be detected is assembly data included in the binary program file;
step 102, identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected;
the preset detection model is obtained based on BERT neural network, graph neural network and classifier training;
the BERT neural network is used for identifying assembly instruction features in the CFG;
the graph neural network is used for identifying graph structure characteristics of the CFG;
and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.
Therefore, whether the corresponding data to be detected is the vulnerability data or not can be identified.
Specifically, the method further comprises: training a BERT neural network, a training graph neural network and a training classifier; alternatively, the method comprises: jointly training a BERT neural network, a graph neural network and a classifier.
The three can be trained simultaneously or separately.
Specifically, the training BERT neural network includes:
acquiring a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; the CFG includes: multiple CFG nodes and edges between CFG nodes; the CFG node is an assembly instruction, and the edge is a conversion instruction; the CFG is obtained by splitting a program into determined functions by using a disassembling program;
converting a token of an assembler instruction in the CFG node into a one-hot code, inputting the one-hot code into a BERT neural network, and training by using a Mask Language Model (MLM) as a task to obtain a trained BERT neural network as a part of a preset detection model;
wherein training the BERT neural network comprises: initial training and secondary training;
the initial training includes: in the BERT neural network, an input layer randomly shields input with a preset proportion, and an output layer predicts original input so as to execute initial training on the network;
the secondary training includes: after obtaining the characteristics of each CFG node, the prediction task of the next CFG node is performed to take the upstream and downstream nodes together with the CFG nodes constructed randomly as input to predict whether the paired CFG nodes actually have upstream and downstream relationships.
Specifically, the training graph neural network comprises:
acquiring a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; each CFG node is embedded with a defined node;
and inputting the label corresponding to each CFG and CFG in the training data set into a graph neural network, and training the graph neural network to obtain the trained graph neural network.
Here, the embedding characterizes a vector; that is, each CFG node corresponds to a defined node vector.
Specifically, the training classifier includes:
inputting the assembly instruction features, the graph structure features and the labels corresponding to the CFGs, which are obtained by identifying the CFGs in the training data set, into a fully-connected neural network, training the fully-connected neural network, and obtaining the trained fully-connected neural network as a classifier.
The method comprises the steps that a fully-connected neural network is adopted as a classifier to be trained, and classification training is carried out on the classifier to be trained on the basis of a labeled vulnerability data set;
and taking the obtained classifier after classification training as a classifier in the detection model.
The classifier adopts an encoder model stacked by a multi-layer perception machine MLP layer, and performs classification training by using a softmax activation function; the trained classifier may be used to determine the location of the vulnerability data.
Specifically, after the data to be detected is acquired, the method further includes: converting the data to be detected into at least one CFG to be detected;
correspondingly, the identifying the data to be detected by using a preset detection model and determining the vulnerability data in the data to be detected comprise:
and identifying the at least one CFG to be detected by using a preset detection model, and determining vulnerability data in the data to be detected.
The following is detailed with respect to the training data set:
a computer (subject) can take a binary program and disassemble it, and the program can be split into many functions using the disassembler, i.e., a Control-Flow-Graph (CFG) is created for each function.
A CFG is a special graph in which consistent assembly instructions form nodes and conversion instructions are edges (i.e., the CFG includes edges between multiple CFG nodes and CFG nodes; the CFG nodes are consistent assembly instructions (hereinafter assembly instructions) and the edges are conversion instructions). A consistent assembly instruction is an instruction that does not change the flow of execution (e.g., add, mov), while a translate instruction is an instruction that changes the flow of execution (e.g., jmp, jme, jne, call, ret).
Here, each function in the binary system is represented by a CFG, and each CFG is assigned with a label (the label includes two types: with a bug and without a bug). If there is a vulnerability, marking out the corresponding universal vulnerability Enumeration (CWE) label. The training data set consists of various open source programs with known vulnerabilities. The accuracy of the data set may also be verified by fuzz testing. There are at least 100 examples in each CWE tag category.
The CFG comprises a plurality of CFG nodes, and each CFG node is an assembly instruction; namely, the nodes in the CFG are assembly instructions, and then the assembly instructions are split to obtain a plurality of tokens;
that is, the assembler instruction may be broken into smaller units, which may be referred to as tokenization (tokenization). Each smaller unit is called a token. All assembler-corresponding instructions in the node need to be tokenized. Assuming that the instruction has exactly k (k ═ 3) operands (operands), the extra operands will be pruned, and the missing operands will be filled with empty tokens (empty tokens). One operand is represented as m (m 13) tokens.
The neural network is illustrated below with respect to the graph:
the neural network should learn the assembler instructions before learning the binary vulnerability. In order to properly learn the assembly instructions in the deep learning model, preprocessing is required; the pretreatment comprises the following steps:
a vocabulary is constructed for the assembly instruction by an appropriate morpheme parser by extracting library functions, opcodes, registers and hexadecimal values from the assembly instruction and treating them as words.
Assembly instructions typically have a fixed syntax, using 1 opcode and 3 operands. The tokenizer represents each operand as 13 fields. Thus, the tagged instruction is 40(3 operands, each represented as 13 fields, one field of the operator) in length.
The set of all possible tokens is referred to herein as a vocabulary (e.g., FIG. 2), with the size of the vocabulary being Vt; the number of word lists is much smaller than the number of word sets used in general NLP. Since assembly instructions typically have a fixed syntax, meaning that it uses one opcode and three operands, a fixed dimension size vector can be created to represent the assembly instruction. When the method is applied, the assembly instruction is read by a parser (paeser) and an appropriate vector representation is constructed for syntax, each operation code and operand are represented by a vector, and the subsequent processing is convenient, and the process is called 'instruction embedding'.
It should be explained that in a neural network, embedding is a special vector, so embedding must be a vector, but a vector is not necessarily embedding. The instruction embeds a process, the resulting embedding being a vector.
The following is described with respect to the library functions, opcodes, registers, and hexadecimal values:
library functions (Library functions) are one way to put functions into a Library for use by others. The method is to compile and put some commonly used functions into a file for different people to call. When the file name is called, the file name of the file name is added into the file name by using a # include < >. Typically into a lib file.
Opcode refers to the portion of an instruction or field (usually in code) specified in a computer program that is to perform an operation, and is simply an assembly instruction number that tells the CPU which instruction to execute.
The register functions to store binary codes and is formed by combining flip-flops having a storage function. One flip-flop can store 1-bit binary codes, so a register for storing n-bit binary codes needs to be formed by n flip-flops.
Are represented by the numbers 0 to 9 and the letters A to F (or a to F), wherein A to F represent 10 to 15, which are referred to as hexadecimal numbers.
Regarding Instruction Embedding (Instruction Embedding). To obtain the embedding of tagged instructions, a word2vec model is typically employed; for example using the word2vec model of a skip-gram version, i.e. predicting the tokens of a context given an input token. The model is trained on the assembler instruction data set of the token in an unsupervised manner in the training data set section.
The next step is to create a token embedding. The input to this step is a tokenization instruction. After the model is trained, a real-valued vector is obtained for each token, and the real-valued vectors are called Embedding (Embedding); the above creation process is called "token embedding".
One-hot (one-hot) encoding is employed to vectorize instructions. Here, the one-hot-encoding is a method of quantizing classified data, which generates a vector having a length equal to the size Vt of a vocabulary. If the data is of the ith category, the ith bit is labeled 1 and the other bits are 0. So that the tokenized instructions can be converted into vectors.
To obtain the most efficient representation of the detected CWE, token embedding is implemented using the bert (bidirectional Encoder retrieval from transforms) model. In particular, BERT is trained in an unsupervised manner (without the need to label data), using MLM (Masked Language Model task) and NSP (Next sequence Prediction task) tasks, where the sentences are one line in the assembly function obtained from a binary file, and the "one-hot" token tokens are fed into the Model to be embedded:
Figure BDA0002864477640000091
v=BERT(vhot)∈Rn×1
where n-510 is the instruction embedding size, v is the instruction embedding, and Vt is the vocabulary size. As a result, a real-valued vector is obtained for each token.
The BERT is used to learn the linguistic features of the assembly instructions in the CFG nodes.
The specific scheme is as follows: the tokens of the assembler instructions in the CFG nodes are converted into one-hot (one-hot) codes, input into the BERT neural network, and use MLM as a task.
In the BERT neural network, the input layer randomly masks a portion of the input (30%), and the output layer predicts the original input to perform initial training on the network. After the linguistic characteristics of the assembler instruction for each CFG node are obtained, the next node prediction task is performed, i.e., the upstream and downstream nodes are put together with some randomly constructed nodes as input to predict whether pairs of nodes do indeed have upstream and downstream relationships. The neural network was trained twice using this task. In this way, an instruction embedded Encoder (Encoder) can be obtained. This encoder accepts assembly instructions, the input instructions being embedded.
Then, the next step is to create Graph embedding (Graph Embeddin). The description about Graph Embedding (Graph Embedding) is as follows:
in this step, a CFG graph is taken as input, where each CFG node has a defined node embedding. The output of this step is the embedding of the entire CFG map. This embedding is further used to classify the entire graph (each graph is a function) as being vulnerable or not vulnerable, and therefore, a graph encoder (graph encoder) is used to generate this embedding.
The graph encoder is generated based on the DUGNN model. Firstly, embedding a computing node; then, sending the node embedding matrix and the graph adjacent matrix into DUGNN to obtain a graph embedding matrix; it is converted into a graph embedding vector using an Attention layer (Attention layer).
As shown in fig. 3, a simplified DUGNN model architecture; wherein, X-node feature Matrix and A-adaptation Matrix represent input data, and Graph embedding represents output data. Attention denotes the neural network layer; GCNIN, Pooling, indicates that the block behind it is repeated many times. Arrows indicate data and operational flow.
The DUGNN model is defined below (FIG. 3). The DUGNN model uses graph convolution layers (we call them gconv layers here), where M is the number of gconv layers in the model. The gconv layer is defined according to formula (1). Each gcnv layer consists of N convolutional layers called gfc, which are defined in the equations (2) (3) in a nested manner.
Figure BDA0002864477640000111
Figure BDA0002864477640000112
Figure BDA0002864477640000113
MLP will be used to represent fully-connected layers.
To increase the model complexity of the graph encoder, higher order statistics of features are captured during Graph Convolution (GCN) operations, similar to graph capsule networks. In addition to increasing the complexity of the model, there is also a problem with laplacian smoothing in the GCN model. To overcome the smoothing problem, it is proposed to concatenate the output of the intermediate coding layer of the Graph Convolution (GCN) and feed it into the subsequent layer. In this way, the graph encoder can learn the multi-scale smoothing characteristics of the nodes, thereby avoiding under-smoothing or over-smoothing problems in the GCN-based model. In addition, it has the side benefits of alleviating the vanishing gradient problem and enhancing feature propagation in deep networks.
To obtain graph embedding, a CFG node representation in the CFG graph needs to be obtained as described above. The first m node instructions ("tokenization instructions") are vectorized using one-hot encoding into vectors:
Figure BDA0002864477640000114
this formula represents a one-hot encoding of tokenized instruction i, resulting in a vector.
Figure BDA0002864477640000115
The tokenized instructions in the same node are concatenated (concat) to obtain the instruction vector Vhot.
Then, the BERT model is applied to obtain node embedding:
X=BERT(vhot)
X∈Rm×1
where m 510 is the size of the embedding. Thus, X is the embedding of the nodes in the graph.
In the above formula, k represents the node embedding size;
m represents the number of gconv layers in DUGNN;
n represents the number of gfc in the gcnv layer;
p represents a polynomial index in the gcnv layer;
x represents graph node embedding;
a represents a graph adjacency matrix;
n represents the number of nodes in the graph;
l represents a graph embedding size;
r represents a real number space;
x denotes the resulting node embedding.
The graph filter function with adjacency matrix a and identity matrix I will be denoted below using g (a):
g(A)=A+I
graph embedding vgraph is defined as the output of the last convolution. The convolution is defined as F:
υgraph=FM(X,A)∈Rl×n
during the convolution operation, the output of the middle layer is fed to the subsequent layers:
Fl(X,A)=concat(Fl-1(X,A),selu(g conv(Fl-1(X,A),A)))
each convolutional gconv layer is applied to X1,X2,...XPGfc is calculated for all possible combinations of i and P, where i ∈ { 1., N } and P ∈ { 1.,. P }.
Figure BDA0002864477640000121
gfc layers are defined as:
Figure BDA0002864477640000131
Figure BDA0002864477640000132
using attention layer (MLP)attentionHandle vgraph∈Rl×|nodes|Conversion to v'graph∈Rl ×1
wattention=softmax(MLPattention(vgraph))
wattention∈Rn×1
Wherein, wattentionIs the attention score.
Summing the attention scores of the nodes to obtain the graph vector embedding:
v′graph=sum(vgraph·wattention)∈Rl×1
further description regarding the classifier follows.
Using multi-layer perceptual Machine (MLP) layer stacking to the encoder model, classification is done using the softmax activation function:
Figure BDA0002864477640000133
selecting nodes that are more likely to be vulnerability locations according to the attention score:
node=argmax(|wattention|)
argmax is the node where the absolute value of the attention score is largest.
The description of classification training is as follows:
consider that the losses in the model include two parts: CELoss (cross entry loss) of the label, BCELoss (binary cross entry loss) of the attention layer.
If the node is the location where the vulnerability occurs, then BCELoss is applied
Figure BDA0002864477640000141
One-hot one-vector, where 1 appears at the index of the vulnerability location and the other index locations are 0. If not, then,
Figure BDA0002864477640000142
is a position of
Figure BDA0002864477640000143
The vector of (2).
Namely, the following loss function is adopted for classification training:
Figure BDA0002864477640000144
wherein class _ frac and attribute _ frac correspond to scores of CELOSs and BCELOSs losses, respectively.
Figure BDA0002864477640000145
And
Figure BDA0002864477640000146
respectively, a label and a prediction with a vulnerability node.
In summary, the method provided in the embodiment of the present invention for training a preset detection model (i.e. a neural network) includes the following steps:
firstly, training a BERT model on an unmarked assembly instruction;
then, the DUGNN model is trained on a large training dataset, the labels of which may not be accurate enough;
continuing to train the small training data set with the accurate label;
finally, the model is checked on the Juliet dataset (synthetic data not suitable for training).
Fixing the network structures and weights of the neural assembly language encoder and the graph neural network encoder obtained by training (without weight updating in the process of back propagation) to be used as a CFG (computational fluid dynamics) main network encoder; and then, connecting a fully-connected neural network as a classifier, performing classification training based on the labeled vulnerability data set, and updating parameters of the neural network in the training process for training the neural network. The trained network can be used for vulnerability detection.
Fig. 4 is a schematic structural diagram of a vulnerability detection apparatus according to an embodiment of the present invention; as shown in fig. 4, the apparatus includes:
the acquisition module is used for acquiring data to be detected; the data to be detected is assembly data included in the binary program file;
the detection module is used for identifying the data to be detected by using a preset detection model and determining vulnerability data in the data to be detected;
the preset detection model is obtained based on BERT neural network, graph neural network and classifier training;
the BERT neural network is used for identifying assembly instruction features in the CFG;
the graph neural network is used for identifying graph structure characteristics of the CFG;
and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.
Specifically, the apparatus further comprises: the preprocessing module is used for training a BERT neural network, a training graph neural network and a training classifier; or used for jointly training a BERT neural network, a graph neural network and a classifier.
Specifically, the preprocessing module is configured to obtain a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; the CFG includes: multiple CFG nodes and edges between CFG nodes; the CFG node is an assembly instruction, and the edge is a conversion instruction; the CFG is obtained by splitting a program into determined functions by using a disassembling program;
converting tokens of assembly instructions in CFG nodes into one-hot codes, inputting the tokens into a BERT neural network, and training by using MLM as a task to obtain the trained BERT neural network as a part of a preset detection model;
wherein training the BERT neural network comprises: initial training and secondary training;
the initial training includes: in the BERT neural network, an input layer randomly shields input with a preset proportion, and an output layer predicts original input so as to execute initial training on the network;
the secondary training includes: after obtaining the characteristics of each CFG node, the prediction task of the next CFG node is performed to take the upstream and downstream nodes together with the CFG nodes constructed randomly as input to predict whether the paired CFG nodes actually have upstream and downstream relationships.
Specifically, the preprocessing module is configured to obtain a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; each CFG node is embedded with a defined node; the embedded token vector;
and inputting the label corresponding to each CFG and CFG in the training data set into a graph neural network, and training the graph neural network to obtain the trained graph neural network.
Specifically, the preprocessing module is configured to input a fully-connected neural network according to assembly instruction features, diagram structure features, and labels corresponding to CFGs obtained by processing the CFGs in the training data set, train the fully-connected neural network, and obtain the trained fully-connected neural network as a classifier.
Specifically, the classifier adopts an encoder model stacked by a multi-layer perception machine MLP layer, and performs classification training by using a softmax activation function; the trained classifier may be used to determine the location of the vulnerability data.
Specifically, the acquiring module is configured to, after acquiring data to be detected, convert the data to be detected into at least one CFG to be detected;
correspondingly, the detection module is configured to identify the at least one to-be-detected CFG by using a preset detection model, and determine vulnerability data in the to-be-detected data.
It should be noted that: in the method for detecting a vulnerability, the division of each program module is only used for illustration when the corresponding vulnerability detection method is implemented, and in practical applications, the processing distribution may be completed by different program modules as needed, that is, the internal structure of the electronic device is divided into different program modules to complete all or part of the processing described above. In addition, the apparatus provided by the above embodiment and the embodiment of the corresponding method belong to the same concept, and the specific implementation process thereof is described in the method embodiment, which is not described herein again.
Fig. 5 is a schematic structural diagram of a communication device according to an embodiment of the present invention, and as shown in fig. 5, the communication device 50 includes: a processor 501 and a memory 502 for storing computer programs executable on the processor; the processor 501 is configured to, when running the computer program, perform: acquiring data to be detected; the data to be detected is assembly data included in the binary program file; identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected; the preset detection model is obtained based on BERT neural network, graph neural network and classifier training; the BERT neural network is used for identifying assembly instruction features in the CFG; the graph neural network is used for identifying graph structure characteristics of the CFG; and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.
The related operations have been described in the above embodiments, and are not described again here.
In practical applications, the communication device 50 may further include: at least one network interface 503. The various components in the communication device 50 are coupled together by a bus system 504. It is understood that the bus system 504 is used to enable communications among the components. The bus system 504 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 504 in fig. 5. The number of the processors 501 may be at least one. The network interface 503 is used for communication between the communication device 50 and other devices in a wired or wireless manner. Memory 502 is used to store various types of data to support the operation of communication device 50.
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored; the computer program, when executed by a processor, performs: acquiring data to be detected; the data to be detected is assembly data included in the binary program file; identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected; the preset detection model is obtained based on BERT neural network, graph neural network and classifier training; the BERT neural network is used for identifying assembly instruction features in the CFG; the graph neural network is used for identifying graph structure characteristics of the CFG; and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.
The related operations have been described in the above embodiments, and are not described again here.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A vulnerability detection method, the method comprising:
acquiring data to be detected; the data to be detected is assembly data included in the binary program file;
identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected;
the preset detection model is obtained based on BERT neural network, graph neural network and classifier training;
the BERT neural network is used for identifying assembly instruction features in the CFG;
the graph neural network is used for identifying graph structure characteristics of the CFG;
and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.
2. The method of claim 1, further comprising: training a BERT neural network, a training graph neural network and a training classifier; alternatively, the method comprises: jointly training a BERT neural network, a graph neural network and a classifier.
3. The method of claim 2, wherein the training the BERT neural network comprises:
acquiring a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; the CFG includes: multiple CFG nodes and edges between CFG nodes; the CFG node is an assembly instruction, and the edge is a conversion instruction; the CFG is obtained by splitting a program into determined functions by using a disassembling program;
converting tokens of assembly instructions in CFG nodes into one-hot codes, inputting the tokens into a BERT neural network, and training by using MLM as a task to obtain the trained BERT neural network as a part of a preset detection model;
wherein training the BERT neural network comprises: initial training and secondary training;
the initial training includes: in the BERT neural network, an input layer randomly shields input with a preset proportion, and an output layer predicts original input so as to execute initial training on the network;
the secondary training includes: after obtaining the characteristics of each CFG node, the prediction task of the next CFG node is performed to take the upstream and downstream nodes together with the CFG nodes constructed randomly as input to predict whether the paired CFG nodes actually have upstream and downstream relationships.
4. The method of claim 3, wherein the training graph neural network comprises:
acquiring a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; each CFG node is embedded with a defined node; the embedded token vector;
and inputting the label corresponding to each CFG and CFG in the training data set into a graph neural network, and training the graph neural network to obtain the trained graph neural network.
5. The method of claim 3, wherein training the classifier comprises:
inputting the assembly instruction features, the graph structure features and the labels corresponding to the CFGs, which are obtained by processing the CFGs in the training data set, into a fully-connected neural network, training the fully-connected neural network, and obtaining the trained fully-connected neural network as a classifier.
6. The method of claim 5, wherein the classifier employs an encoder model to which multi-layer perceptual Machine (MLP) layers are stacked, and performs classification training using a softmax activation function; the trained classifier may be used to determine the location of the vulnerability data.
7. The method of claim 1, wherein after the acquiring the data to be detected, the method further comprises: converting the data to be detected into at least one CFG to be detected;
correspondingly, the identifying the data to be detected by using a preset detection model and determining the vulnerability data in the data to be detected comprise:
and identifying the at least one CFG to be detected by using a preset detection model, and determining vulnerability data in the data to be detected.
8. A vulnerability detection apparatus, the apparatus comprising:
the acquisition module is used for acquiring data to be detected; the data to be detected is assembly data included in the binary program file;
the detection module is used for identifying the data to be detected by using a preset detection model and determining vulnerability data in the data to be detected;
the preset detection model is obtained based on BERT neural network, graph neural network and classifier training;
the BERT neural network is used for identifying assembly instruction features in the CFG;
the graph neural network is used for identifying graph structure characteristics of the CFG;
and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.
9. A vulnerability detection apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202011577796.9A 2020-12-28 2020-12-28 Vulnerability detection method and device and storage medium Pending CN112560049A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011577796.9A CN112560049A (en) 2020-12-28 2020-12-28 Vulnerability detection method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011577796.9A CN112560049A (en) 2020-12-28 2020-12-28 Vulnerability detection method and device and storage medium

Publications (1)

Publication Number Publication Date
CN112560049A true CN112560049A (en) 2021-03-26

Family

ID=75033816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011577796.9A Pending CN112560049A (en) 2020-12-28 2020-12-28 Vulnerability detection method and device and storage medium

Country Status (1)

Country Link
CN (1) CN112560049A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326187A (en) * 2021-05-25 2021-08-31 扬州大学 Data-driven intelligent detection method and system for memory leakage
CN113378176A (en) * 2021-06-11 2021-09-10 大连海事大学 Software vulnerability identification method with weight deviation based on graph neural network detection
CN113792820A (en) * 2021-11-15 2021-12-14 航天宏康智能科技(北京)有限公司 Countermeasure training method and device for user behavior log anomaly detection model
CN113821723A (en) * 2021-09-22 2021-12-21 广州博冠信息科技有限公司 Searching method and device and electronic equipment
CN115495755A (en) * 2022-11-15 2022-12-20 四川大学 Codebert and R-GCN-based source code vulnerability multi-classification detection method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326187A (en) * 2021-05-25 2021-08-31 扬州大学 Data-driven intelligent detection method and system for memory leakage
CN113326187B (en) * 2021-05-25 2023-11-24 扬州大学 Data-driven memory leakage intelligent detection method and system
CN113378176A (en) * 2021-06-11 2021-09-10 大连海事大学 Software vulnerability identification method with weight deviation based on graph neural network detection
CN113378176B (en) * 2021-06-11 2023-06-23 大连海事大学 Software vulnerability identification method based on graph neural network detection with weight deviation
CN113821723A (en) * 2021-09-22 2021-12-21 广州博冠信息科技有限公司 Searching method and device and electronic equipment
CN113821723B (en) * 2021-09-22 2024-04-12 广州博冠信息科技有限公司 Searching method and device and electronic equipment
CN113792820A (en) * 2021-11-15 2021-12-14 航天宏康智能科技(北京)有限公司 Countermeasure training method and device for user behavior log anomaly detection model
CN115495755A (en) * 2022-11-15 2022-12-20 四川大学 Codebert and R-GCN-based source code vulnerability multi-classification detection method

Similar Documents

Publication Publication Date Title
CN112560049A (en) Vulnerability detection method and device and storage medium
Wattanakriengkrai et al. Predicting defective lines using a model-agnostic technique
Chakraborty et al. Deep learning based vulnerability detection: Are we there yet?
CN110908640B (en) Method for realizing service function and script engine
CN111507086B (en) Automatic discovery of translated text locations in localized applications
CN110287702B (en) Binary vulnerability clone detection method and device
CN109376535B (en) Vulnerability analysis method and system based on intelligent symbolic execution
US20060200796A1 (en) Program development apparatus, method for developing a program, and a computer program product for executing an application for a program development apparatus
CN109144879B (en) Test analysis method and device
CN116756041A (en) Code defect prediction and positioning method and device, storage medium and computer equipment
Nam et al. A bug finder refined by a large set of open-source projects
Xue et al. Hecate: Automated customization of program and communication features to reduce attack surfaces
CN115935369A (en) Method for evaluating source code using numeric array representation of source code elements
Zhao et al. Suzzer: A vulnerability-guided fuzzer based on deep learning
CN117591913A (en) Statement level software defect prediction method based on improved R-transducer
Gruner et al. Cross-domain evaluation of a deep learning-based type inference system
Guo et al. Exploring gnn based program embedding technologies for binary related tasks
CN116595537A (en) Vulnerability detection method of generated intelligent contract based on multi-mode features
GB2513348A (en) Translation validation
CN113836297B (en) Training method and device for text emotion analysis model
Lavoie et al. A case study of TTCN-3 test scripts clone analysis in an industrial telecommunication setting
Moharil et al. Tabasco: A transformer based contextualization toolkit
Wang et al. Multi-type source code defect detection based on TextCNN
Salman Test Case Generation from Specifications Using Natural Language Processing
Hou et al. A vulnerability detection algorithm based on transformer model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination