CN112560049A

CN112560049A - Vulnerability detection method and device and storage medium

Info

Publication number: CN112560049A
Application number: CN202011577796.9A
Authority: CN
Inventors: 冯继强; 蒋磊; 朱鲲鹏
Original assignee: Suzhou Aurora Infinite Information Technology Co ltd
Current assignee: Suzhou Aurora Infinite Information Technology Co ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-03-26

Abstract

The invention discloses a vulnerability detection method, a vulnerability detection device and a storage medium, wherein the method comprises the following steps: acquiring data to be detected; the data to be detected is assembly data included in the binary program file; identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected; the preset detection model is obtained based on BERT neural network, graph neural network and classifier training; the BERT neural network is used for identifying assembly instruction features in the CFG; the graph neural network is used for identifying graph structure characteristics of the CFG; and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.

Description

Vulnerability detection method and device and storage medium

Technical Field

The present invention relates to the field of computers, and in particular, to a vulnerability detection method, device and storage medium.

Background

The existing technical scheme for vulnerability detection comprises the following steps: source code review, fuzz testing, dynamic and static analysis, and the like.

Source code inspection is a vulnerability detection technique that is performed if software source code has been obtained. Security technologists are typically required to inspect the software source code. By means of source code examination, potential safety hazards such as memory leakage, stack overflow and the like can be found.

Fuzz testing is a software testing technique that inputs automatically or semi-automatically generated random data into a program, monitors the program for anomalies such as crashes, assertion failures, etc. to discover possible program errors.

Static analysis refers to scanning program codes by lexical analysis, syntactic analysis, control flow, data flow analysis, and other techniques to find various defects of a program without running the code. The static program analysis can help software developers and quality assurance personnel to find structural errors, security holes and other problems in the codes, so that the overall quality of the software is guaranteed.

Dynamic analysis refers to the manner in which analysis is performed by executing a program on a processor. In order for the dynamic analysis to be effective, the target program must be executed using enough test inputs to cover almost all possible outputs.

The above solutions have different drawbacks, respectively, in particular:

and (3) source code examination: requiring a detailed review of the source code by a security specialist or developer, the time spent by the developer on each function is almost doubled, and for some very large projects, a detailed review of the source code is an almost impossible task.

And (3) fuzzy testing: fuzz testing is currently the most widely used way to detect vulnerabilities, but is performed very slowly, usually in months, and is limited to vulnerabilities that lead to crashes.

Static analysis: very slow, at least in days, requires experts in safety and is limited to given regulations and difficult to scale.

Dynamic analysis: it can be performed with partial aid of tools, but is generally difficult to implement and has a limited range of applications, since it is a very difficult task to cover all possible outputs.

Disclosure of Invention

In view of the above, the present invention provides a vulnerability detection method, apparatus and storage medium.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

the embodiment of the invention provides a vulnerability detection method, which comprises the following steps:

acquiring data to be detected; the data to be detected is assembly data included in the binary program file;

identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected;

the preset detection model is obtained based on BERT neural network, graph neural network and classifier training;

the BERT neural network is used for identifying assembly instruction features in the CFG;

the graph neural network is used for identifying graph structure characteristics of the CFG;

and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.

Further, the method further comprises: training a BERT neural network, a training graph neural network and a training classifier; alternatively, the method comprises: jointly training a BERT neural network, a graph neural network and a classifier.

Further, the training BERT neural network includes:

acquiring a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; the CFG includes: multiple CFG nodes and edges between CFG nodes; the CFG node is an assembly instruction, and the edge is a conversion instruction; the CFG is obtained by splitting a program into determined functions by using a disassembling program;

converting tokens of assembly instructions in CFG nodes into one-hot codes, inputting the tokens into a BERT neural network, and training by using MLM as a task to obtain the trained BERT neural network as a part of a preset detection model;

wherein training the BERT neural network comprises: initial training and secondary training;

the initial training includes: in the BERT neural network, an input layer randomly shields input with a preset proportion, and an output layer predicts original input so as to execute initial training on the network;

the secondary training includes: after obtaining the characteristics of each CFG node, the prediction task of the next CFG node is performed to take the upstream and downstream nodes together with the CFG nodes constructed randomly as input to predict whether the paired CFG nodes actually have upstream and downstream relationships.

Further, the training graph neural network includes:

acquiring a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; each CFG node is embedded with a defined node; the embedded token vector;

and inputting the label corresponding to each CFG and CFG in the training data set into a graph neural network, and training the graph neural network to obtain the trained graph neural network.

Further, the training classifier includes:

inputting the assembly instruction features, the graph structure features and the labels corresponding to the CFGs, which are obtained by processing the CFGs in the training data set, into a fully-connected neural network, training the fully-connected neural network, and obtaining the trained fully-connected neural network as a classifier.

Further, the classifier adopts an encoder model stacked by a multilayer perception Machine (MLP) layer, and performs classification training by using a softmax activation function; the trained classifier may be used to determine the location of the vulnerability data.

Further, after the data to be detected is acquired, the method further includes: converting the data to be detected into at least one CFG to be detected;

correspondingly, the identifying the data to be detected by using a preset detection model and determining the vulnerability data in the data to be detected comprise:

and identifying the at least one CFG to be detected by using a preset detection model, and determining vulnerability data in the data to be detected.

The embodiment of the invention provides a vulnerability detection device, which comprises:

the acquisition module is used for acquiring data to be detected; the data to be detected is assembly data included in the binary program file;

the detection module is used for identifying the data to be detected by using a preset detection model and determining vulnerability data in the data to be detected;

The embodiment of the invention provides a vulnerability detection device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the steps of the vulnerability detection method.

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of any one of the vulnerability detection methods described above.

The embodiment of the invention provides a vulnerability detection method, which comprises the following steps: acquiring data to be detected; the data to be detected is assembly data included in the binary program file; identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected; the preset detection model is obtained based on BERT neural network, graph neural network and classifier training; the BERT neural network is used for identifying assembly instruction features in the CFG; the graph neural network is used for identifying graph structure characteristics of the CFG; the classifier is used for identifying vulnerability data based on the assembly instruction features and the graph structure features; therefore, rapid, efficient and accurate vulnerability detection is realized.

Drawings

Fig. 1 is a schematic diagram of a vulnerability detection method according to an embodiment of the present invention;

FIG. 2 is a diagram of a vocabulary according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a DUGNN model architecture according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a vulnerability detection apparatus according to an embodiment of the present invention;

fig. 5 is a schematic diagram of another vulnerability detection apparatus according to an embodiment of the present invention.

Detailed Description

The method provided by the embodiment of the invention obtains the data to be detected; the data to be detected is assembly data included in the binary program file; identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected; the preset detection model is obtained based on BERT neural network, graph neural network and classifier training; the BERT neural network is used for identifying assembly instruction features in the CFG; the graph neural network is used for identifying graph structure characteristics of the CFG; and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.

The present invention will be described in further detail with reference to examples.

Fig. 1 is a schematic flowchart of a vulnerability detection method according to an embodiment of the present invention; as shown in fig. 1, the present invention is applied to electronic devices, such as computers, servers, and the like; the method comprises the following steps:

step 101, acquiring data to be detected; the data to be detected is assembly data included in the binary program file;

step 102, identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected;

Therefore, whether the corresponding data to be detected is the vulnerability data or not can be identified.

Specifically, the method further comprises: training a BERT neural network, a training graph neural network and a training classifier; alternatively, the method comprises: jointly training a BERT neural network, a graph neural network and a classifier.

The three can be trained simultaneously or separately.

Specifically, the training BERT neural network includes:

converting a token of an assembler instruction in the CFG node into a one-hot code, inputting the one-hot code into a BERT neural network, and training by using a Mask Language Model (MLM) as a task to obtain a trained BERT neural network as a part of a preset detection model;

Specifically, the training graph neural network comprises:

acquiring a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; each CFG node is embedded with a defined node;

Here, the embedding characterizes a vector; that is, each CFG node corresponds to a defined node vector.

Specifically, the training classifier includes:

inputting the assembly instruction features, the graph structure features and the labels corresponding to the CFGs, which are obtained by identifying the CFGs in the training data set, into a fully-connected neural network, training the fully-connected neural network, and obtaining the trained fully-connected neural network as a classifier.

The method comprises the steps that a fully-connected neural network is adopted as a classifier to be trained, and classification training is carried out on the classifier to be trained on the basis of a labeled vulnerability data set;

and taking the obtained classifier after classification training as a classifier in the detection model.

The classifier adopts an encoder model stacked by a multi-layer perception machine MLP layer, and performs classification training by using a softmax activation function; the trained classifier may be used to determine the location of the vulnerability data.

Specifically, after the data to be detected is acquired, the method further includes: converting the data to be detected into at least one CFG to be detected;

The following is detailed with respect to the training data set:

a computer (subject) can take a binary program and disassemble it, and the program can be split into many functions using the disassembler, i.e., a Control-Flow-Graph (CFG) is created for each function.

A CFG is a special graph in which consistent assembly instructions form nodes and conversion instructions are edges (i.e., the CFG includes edges between multiple CFG nodes and CFG nodes; the CFG nodes are consistent assembly instructions (hereinafter assembly instructions) and the edges are conversion instructions). A consistent assembly instruction is an instruction that does not change the flow of execution (e.g., add, mov), while a translate instruction is an instruction that changes the flow of execution (e.g., jmp, jme, jne, call, ret).

Here, each function in the binary system is represented by a CFG, and each CFG is assigned with a label (the label includes two types: with a bug and without a bug). If there is a vulnerability, marking out the corresponding universal vulnerability Enumeration (CWE) label. The training data set consists of various open source programs with known vulnerabilities. The accuracy of the data set may also be verified by fuzz testing. There are at least 100 examples in each CWE tag category.

The CFG comprises a plurality of CFG nodes, and each CFG node is an assembly instruction; namely, the nodes in the CFG are assembly instructions, and then the assembly instructions are split to obtain a plurality of tokens;

that is, the assembler instruction may be broken into smaller units, which may be referred to as tokenization (tokenization). Each smaller unit is called a token. All assembler-corresponding instructions in the node need to be tokenized. Assuming that the instruction has exactly k (k ═ 3) operands (operands), the extra operands will be pruned, and the missing operands will be filled with empty tokens (empty tokens). One operand is represented as m (m 13) tokens.

The neural network is illustrated below with respect to the graph:

the neural network should learn the assembler instructions before learning the binary vulnerability. In order to properly learn the assembly instructions in the deep learning model, preprocessing is required; the pretreatment comprises the following steps:

a vocabulary is constructed for the assembly instruction by an appropriate morpheme parser by extracting library functions, opcodes, registers and hexadecimal values from the assembly instruction and treating them as words.

Assembly instructions typically have a fixed syntax, using 1 opcode and 3 operands. The tokenizer represents each operand as 13 fields. Thus, the tagged instruction is 40(3 operands, each represented as 13 fields, one field of the operator) in length.

The set of all possible tokens is referred to herein as a vocabulary (e.g., FIG. 2), with the size of the vocabulary being Vt; the number of word lists is much smaller than the number of word sets used in general NLP. Since assembly instructions typically have a fixed syntax, meaning that it uses one opcode and three operands, a fixed dimension size vector can be created to represent the assembly instruction. When the method is applied, the assembly instruction is read by a parser (paeser) and an appropriate vector representation is constructed for syntax, each operation code and operand are represented by a vector, and the subsequent processing is convenient, and the process is called 'instruction embedding'.

It should be explained that in a neural network, embedding is a special vector, so embedding must be a vector, but a vector is not necessarily embedding. The instruction embeds a process, the resulting embedding being a vector.

The following is described with respect to the library functions, opcodes, registers, and hexadecimal values:

library functions (Library functions) are one way to put functions into a Library for use by others. The method is to compile and put some commonly used functions into a file for different people to call. When the file name is called, the file name of the file name is added into the file name by using a # include < >. Typically into a lib file.

Opcode refers to the portion of an instruction or field (usually in code) specified in a computer program that is to perform an operation, and is simply an assembly instruction number that tells the CPU which instruction to execute.

The register functions to store binary codes and is formed by combining flip-flops having a storage function. One flip-flop can store 1-bit binary codes, so a register for storing n-bit binary codes needs to be formed by n flip-flops.

Are represented by the numbers 0 to 9 and the letters A to F (or a to F), wherein A to F represent 10 to 15, which are referred to as hexadecimal numbers.

Regarding Instruction Embedding (Instruction Embedding). To obtain the embedding of tagged instructions, a word2vec model is typically employed; for example using the word2vec model of a skip-gram version, i.e. predicting the tokens of a context given an input token. The model is trained on the assembler instruction data set of the token in an unsupervised manner in the training data set section.

The next step is to create a token embedding. The input to this step is a tokenization instruction. After the model is trained, a real-valued vector is obtained for each token, and the real-valued vectors are called Embedding (Embedding); the above creation process is called "token embedding".

One-hot (one-hot) encoding is employed to vectorize instructions. Here, the one-hot-encoding is a method of quantizing classified data, which generates a vector having a length equal to the size Vt of a vocabulary. If the data is of the ith category, the ith bit is labeled 1 and the other bits are 0. So that the tokenized instructions can be converted into vectors.

To obtain the most efficient representation of the detected CWE, token embedding is implemented using the bert (bidirectional Encoder retrieval from transforms) model. In particular, BERT is trained in an unsupervised manner (without the need to label data), using MLM (Masked Language Model task) and NSP (Next sequence Prediction task) tasks, where the sentences are one line in the assembly function obtained from a binary file, and the "one-hot" token tokens are fed into the Model to be embedded:

v＝BERT(v_hot)∈R^n×1

where n-510 is the instruction embedding size, v is the instruction embedding, and Vt is the vocabulary size. As a result, a real-valued vector is obtained for each token.

The BERT is used to learn the linguistic features of the assembly instructions in the CFG nodes.

The specific scheme is as follows: the tokens of the assembler instructions in the CFG nodes are converted into one-hot (one-hot) codes, input into the BERT neural network, and use MLM as a task.

In the BERT neural network, the input layer randomly masks a portion of the input (30%), and the output layer predicts the original input to perform initial training on the network. After the linguistic characteristics of the assembler instruction for each CFG node are obtained, the next node prediction task is performed, i.e., the upstream and downstream nodes are put together with some randomly constructed nodes as input to predict whether pairs of nodes do indeed have upstream and downstream relationships. The neural network was trained twice using this task. In this way, an instruction embedded Encoder (Encoder) can be obtained. This encoder accepts assembly instructions, the input instructions being embedded.

Then, the next step is to create Graph embedding (Graph Embeddin). The description about Graph Embedding (Graph Embedding) is as follows:

in this step, a CFG graph is taken as input, where each CFG node has a defined node embedding. The output of this step is the embedding of the entire CFG map. This embedding is further used to classify the entire graph (each graph is a function) as being vulnerable or not vulnerable, and therefore, a graph encoder (graph encoder) is used to generate this embedding.

The graph encoder is generated based on the DUGNN model. Firstly, embedding a computing node; then, sending the node embedding matrix and the graph adjacent matrix into DUGNN to obtain a graph embedding matrix; it is converted into a graph embedding vector using an Attention layer (Attention layer).

As shown in fig. 3, a simplified DUGNN model architecture; wherein, X-node feature Matrix and A-adaptation Matrix represent input data, and Graph embedding represents output data. Attention denotes the neural network layer; GCNIN, Pooling, indicates that the block behind it is repeated many times. Arrows indicate data and operational flow.

The DUGNN model is defined below (FIG. 3). The DUGNN model uses graph convolution layers (we call them gconv layers here), where M is the number of gconv layers in the model. The gconv layer is defined according to formula (1). Each gcnv layer consists of N convolutional layers called gfc, which are defined in the equations (2) (3) in a nested manner.

MLP will be used to represent fully-connected layers.

To increase the model complexity of the graph encoder, higher order statistics of features are captured during Graph Convolution (GCN) operations, similar to graph capsule networks. In addition to increasing the complexity of the model, there is also a problem with laplacian smoothing in the GCN model. To overcome the smoothing problem, it is proposed to concatenate the output of the intermediate coding layer of the Graph Convolution (GCN) and feed it into the subsequent layer. In this way, the graph encoder can learn the multi-scale smoothing characteristics of the nodes, thereby avoiding under-smoothing or over-smoothing problems in the GCN-based model. In addition, it has the side benefits of alleviating the vanishing gradient problem and enhancing feature propagation in deep networks.

To obtain graph embedding, a CFG node representation in the CFG graph needs to be obtained as described above. The first m node instructions ("tokenization instructions") are vectorized using one-hot encoding into vectors:

this formula represents a one-hot encoding of tokenized instruction i, resulting in a vector.

The tokenized instructions in the same node are concatenated (concat) to obtain the instruction vector Vhot.

Then, the BERT model is applied to obtain node embedding:

X＝BERT(v_hot)

X∈R^m×1

where m 510 is the size of the embedding. Thus, X is the embedding of the nodes in the graph.

In the above formula, k represents the node embedding size;

m represents the number of gconv layers in DUGNN;

n represents the number of gfc in the gcnv layer;

p represents a polynomial index in the gcnv layer;

x represents graph node embedding;

a represents a graph adjacency matrix;

n represents the number of nodes in the graph;

l represents a graph embedding size;

r represents a real number space;

x denotes the resulting node embedding.

The graph filter function with adjacency matrix a and identity matrix I will be denoted below using g (a):

g(A)＝A+I

graph embedding vgraph is defined as the output of the last convolution. The convolution is defined as F:

υ_graph＝F^M(X，A)∈R^l×n

during the convolution operation, the output of the middle layer is fed to the subsequent layers:

F^l(X，A)＝concat(F^l-1(X，A)，selu(g conv(F^l-1(X，A)，A)))

each convolutional gconv layer is applied to X¹，X²，...X^PGfc is calculated for all possible combinations of i and P, where i ∈ { 1., N } and P ∈ { 1.,. P }.

gfc layers are defined as:

using attention layer (MLP)_attentionHandle v_graph∈R^l×|nodes|Conversion to v'_graph∈R^l ^×1：

w_attention＝softmax(MLP_attention(v_graph))

w_attention∈R^n×1，

Wherein, w_attentionIs the attention score.

Summing the attention scores of the nodes to obtain the graph vector embedding:

v′_graph＝sum(v_graph·w_attention)∈R^l×1

further description regarding the classifier follows.

Using multi-layer perceptual Machine (MLP) layer stacking to the encoder model, classification is done using the softmax activation function:

selecting nodes that are more likely to be vulnerability locations according to the attention score:

node＝argmax(|w_attention|)

argmax is the node where the absolute value of the attention score is largest.

The description of classification training is as follows:

consider that the losses in the model include two parts: CELoss (cross entry loss) of the label, BCELoss (binary cross entry loss) of the attention layer.

If the node is the location where the vulnerability occurs, then BCELoss is applied

One-hot one-vector, where 1 appears at the index of the vulnerability location and the other index locations are 0. If not, then,

is a position of

The vector of (2).

Namely, the following loss function is adopted for classification training:

wherein class _ frac and attribute _ frac correspond to scores of CELOSs and BCELOSs losses, respectively.

And

respectively, a label and a prediction with a vulnerability node.

In summary, the method provided in the embodiment of the present invention for training a preset detection model (i.e. a neural network) includes the following steps:

firstly, training a BERT model on an unmarked assembly instruction;

then, the DUGNN model is trained on a large training dataset, the labels of which may not be accurate enough;

continuing to train the small training data set with the accurate label;

finally, the model is checked on the Juliet dataset (synthetic data not suitable for training).

Fixing the network structures and weights of the neural assembly language encoder and the graph neural network encoder obtained by training (without weight updating in the process of back propagation) to be used as a CFG (computational fluid dynamics) main network encoder; and then, connecting a fully-connected neural network as a classifier, performing classification training based on the labeled vulnerability data set, and updating parameters of the neural network in the training process for training the neural network. The trained network can be used for vulnerability detection.

Fig. 4 is a schematic structural diagram of a vulnerability detection apparatus according to an embodiment of the present invention; as shown in fig. 4, the apparatus includes:

Specifically, the apparatus further comprises: the preprocessing module is used for training a BERT neural network, a training graph neural network and a training classifier; or used for jointly training a BERT neural network, a graph neural network and a classifier.

Specifically, the preprocessing module is configured to obtain a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; the CFG includes: multiple CFG nodes and edges between CFG nodes; the CFG node is an assembly instruction, and the edge is a conversion instruction; the CFG is obtained by splitting a program into determined functions by using a disassembling program;

Specifically, the preprocessing module is configured to obtain a training data set; the training data set includes: at least one CFG and a label corresponding to each CFG; whether the tag characterization is a vulnerability or not; each CFG node is embedded with a defined node; the embedded token vector;

Specifically, the preprocessing module is configured to input a fully-connected neural network according to assembly instruction features, diagram structure features, and labels corresponding to CFGs obtained by processing the CFGs in the training data set, train the fully-connected neural network, and obtain the trained fully-connected neural network as a classifier.

Specifically, the classifier adopts an encoder model stacked by a multi-layer perception machine MLP layer, and performs classification training by using a softmax activation function; the trained classifier may be used to determine the location of the vulnerability data.

Specifically, the acquiring module is configured to, after acquiring data to be detected, convert the data to be detected into at least one CFG to be detected;

correspondingly, the detection module is configured to identify the at least one to-be-detected CFG by using a preset detection model, and determine vulnerability data in the to-be-detected data.

It should be noted that: in the method for detecting a vulnerability, the division of each program module is only used for illustration when the corresponding vulnerability detection method is implemented, and in practical applications, the processing distribution may be completed by different program modules as needed, that is, the internal structure of the electronic device is divided into different program modules to complete all or part of the processing described above. In addition, the apparatus provided by the above embodiment and the embodiment of the corresponding method belong to the same concept, and the specific implementation process thereof is described in the method embodiment, which is not described herein again.

Fig. 5 is a schematic structural diagram of a communication device according to an embodiment of the present invention, and as shown in fig. 5, the communication device 50 includes: a processor 501 and a memory 502 for storing computer programs executable on the processor; the processor 501 is configured to, when running the computer program, perform: acquiring data to be detected; the data to be detected is assembly data included in the binary program file; identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected; the preset detection model is obtained based on BERT neural network, graph neural network and classifier training; the BERT neural network is used for identifying assembly instruction features in the CFG; the graph neural network is used for identifying graph structure characteristics of the CFG; and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.

The related operations have been described in the above embodiments, and are not described again here.

In practical applications, the communication device 50 may further include: at least one network interface 503. The various components in the communication device 50 are coupled together by a bus system 504. It is understood that the bus system 504 is used to enable communications among the components. The bus system 504 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 504 in fig. 5. The number of the processors 501 may be at least one. The network interface 503 is used for communication between the communication device 50 and other devices in a wired or wireless manner. Memory 502 is used to store various types of data to support the operation of communication device 50.

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored; the computer program, when executed by a processor, performs: acquiring data to be detected; the data to be detected is assembly data included in the binary program file; identifying the data to be detected by using a preset detection model, and determining vulnerability data in the data to be detected; the preset detection model is obtained based on BERT neural network, graph neural network and classifier training; the BERT neural network is used for identifying assembly instruction features in the CFG; the graph neural network is used for identifying graph structure characteristics of the CFG; and the classifier is used for identifying the vulnerability data based on the assembly instruction features and the graph structure features.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A vulnerability detection method, the method comprising:

2. The method of claim 1, further comprising: training a BERT neural network, a training graph neural network and a training classifier; alternatively, the method comprises: jointly training a BERT neural network, a graph neural network and a classifier.

3. The method of claim 2, wherein the training the BERT neural network comprises:

4. The method of claim 3, wherein the training graph neural network comprises:

5. The method of claim 3, wherein training the classifier comprises:

6. The method of claim 5, wherein the classifier employs an encoder model to which multi-layer perceptual Machine (MLP) layers are stacked, and performs classification training using a softmax activation function; the trained classifier may be used to determine the location of the vulnerability data.

7. The method of claim 1, wherein after the acquiring the data to be detected, the method further comprises: converting the data to be detected into at least one CFG to be detected;

8. A vulnerability detection apparatus, the apparatus comprising:

9. A vulnerability detection apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.