CN114969763A

CN114969763A - Fine-grained vulnerability detection method based on seq2seq code representation learning

Info

Publication number: CN114969763A
Application number: CN202210700763.1A
Authority: CN
Inventors: 苏小红; 蒋远; 郑伟宁; 陶文鑫; 王甜甜
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-08-30

Abstract

The invention discloses a fine-grained vulnerability detection method based on seq2seq code representation learning, which comprises the steps of firstly extracting vulnerability candidate key nodes as slicing criteria, and then extracting slicing code segments in a program by using a program slicing technology. Then, a seq2 seq-based deep learning model is used for representing and learning the sliced code segments, a sentence vector representation sequence containing long dependency relations among sentences is generated, the vector representation of each sentence in the sequence is sent to a detector, and whether the sentence is a bug sentence or not is detected. The method can fully utilize global and local semantic information in the code to learn characteristics related to the vulnerability in the statement and between statements, avoid the problem that the long dependence information between the vulnerability statement and the context thereof is difficult to capture when the traditional deep learning classification model is used for representing and learning the code, utilize the seq2seq model to represent and learn the generated statement vector representation sequence, and is more suitable for statement-level fine-grained vulnerability detection.

Description

Fine-grained vulnerability detection method based on seq2seq code representation learning

Technical Field

The invention relates to a software vulnerability fine-grained detection method, in particular to a fine-grained vulnerability detection method based on seq2seq code representation learning.

Background

A vulnerability is defined as a security flaw in a computer system that not only threatens the system itself, but also makes it difficult for the system to guarantee the confidentiality, integrity and availability of application data and compromises the security of the system. In recent years, with the increase in system scale and complexity and the introduction of new technologies, the possibility of software bugs has increased. Automatic detection of vulnerabilities is an important means to reduce vulnerabilities. However, at present, most of the vulnerability detection research based on deep learning focuses on coarse-grained levels such as files, functions or fragments, that is, only the possibility that files, functions or codes contain vulnerabilities can be predicted, and in such a situation, the difficulty of manually positioning vulnerability statements by developers is high, so that vulnerabilities cannot be repaired in time. Therefore, the method for detecting the fine-grained vulnerability at the statement level is helpful for developers to understand and quickly repair the vulnerability, and is also the trend of the current vulnerability detection research. In recent years, although there is a small amount of research on fine-grained vulnerability detection methods, the positioning accuracy still needs to be improved. Sparsity and discontinuity of vulnerability information distribution, context dependency among vulnerability statements, complexity and concealment of vulnerability characteristics all provide challenges for fine-grained vulnerability detection.

Disclosure of Invention

The invention aims to provide a fine-grained vulnerability detection method based on seq2seq code representation learning, which can fully utilize global and local semantic information in a code and learn characteristics related to vulnerability in sentences and among sentences, avoid the problem that the long dependence information between the vulnerability sentences and the context thereof is difficult to capture when a traditional deep learning classification model is used for representing and learning the code, utilize a sentence vector representation sequence generated by representing and learning the code by using a seq2seq model, and is more suitable for sentence-level fine-grained vulnerability detection.

The purpose of the invention is realized by the following technical scheme:

a fine-grained vulnerability detection method based on seq2seq code representation learning comprises the steps of firstly, extracting vulnerability candidate key nodes as slicing criteria, and then extracting slicing code segments in a program by using a program slicing technology. Then, the slice code segment is representation-learned by using a seq2 seq-based deep learning model. Specifically, firstly, a sentence coding network in a seq2seq model encoder is used to perform representation learning on token sequences in the slice code segment sentences, and sentence primary vector representation containing local semantic information is generated. The term vector sequence is then representation-learned using a program coding network in the encoder, using as input a term sequence formed from the obtained primary vector representation for each term, to generate a term high-level vector representation containing term context information. Then, global semantic information related to dependencies and vulnerabilities between sentences in the learner is learned in tandem using a dual attention mechanism based on self-attention and textual attention. And finally, using the statement high-level vector representation generated by the encoder as input and combining global semantic information, generating a statement final vector representation sequence containing long dependency relations among statements by using a decoder network in a seq2seq model encoder, sending the final vector representation of each statement in the sequence into a detector, and detecting whether each statement is a bug statement. The method specifically comprises the following steps:

step 1: analyzing the source code by using a static analysis tool to generate an abstract syntax tree and a program dependency graph;

step 2: extracting vulnerability candidate key nodes of a source code by using an abstract syntax tree to serve as a slicing criterion, generating a slicing code segment of the source code, and carrying out standardization processing on the slicing code segment to obtain a named and standardized slicing code segment;

and step 3: using a sentence coding network in a seq2seq deep learning model coder to perform representation learning on token sequences in the slice code segment sentences to generate sentence primary vector representation containing local semantic information;

and 4, step 4: taking a sentence sequence formed by the primary vector representation of each sentence obtained in the step 3 as input, performing representation learning on the sentence sequence by using a program coding network in a seq2seq deep learning model coder, and generating a sentence high-level vector representation containing sentence context information;

and 5: transmitting the sentence sequence formed by the high-level vector expression of the sentence obtained in the step 4 into a double attention module based on self attention and text attention, learning the dependency relationship among the sentences through the self attention, and then learning the global semantic information related to the vulnerability through the text attention;

step 6: taking the sentence sequence formed by the vulnerability-related global semantic information obtained in the step 5 and the sentence high-level vector representation obtained in the step 4 as input, sending the sentence sequence into a long dependence information between learning sentences in a decoder network of a seq2seq deep learning model, and generating a final vector representation of the sentences;

and 7: sending the final vector representation of each statement obtained in the step 6 into a detector network consisting of a multilayer perceptron MLP and a softmax layer to obtain a prediction result of whether the statement has a hole or not, calculating a cross entropy loss function by using label information of the statement, adjusting network parameters according to error back propagation until a loss value does not decrease any more, and finishing training;

and 8: and carrying out statement-level fine-grained vulnerability detection on the code by using the trained model.

The deep learning model based on seq2seq is used for representing and learning codes, wherein a sentence coding network in a coder is used for extracting the dependency relationship between tokens in sentences to obtain local semantic information; the program coding network is used for learning the context information among the sentences; the double attention mechanism is used for acquiring global semantic information related to the vulnerability; the decoder network is used for acquiring long dependency relations among the sentences; the detector network is used for outputting a prediction result of whether the statement has a bug or not.

The existing vulnerability detection method is based on the code representation learning of deep learning models such as a convolutional neural network, a cyclic neural network or a graph neural network and the like to realize vulnerability detection. Different from the existing method, the invention provides a fine-grained vulnerability detection method based on seq2seq code representation learning, and the method uses a seq2seq model for statement-level fine-grained vulnerability detection task for the first time. Unlike the traditional binary model based on deep learning, the seq2seq model commonly used in machine translation task can directly implement sentence-to-sentence mapping, and the seq2seq model is composed of an encoder (encoder) and a decoder (decoder), wherein the encoder is used for encoding the input program sentence sequence with indefinite length into a sentence high-level vector representation with fixed length, and the decoder is used for re-decoding the sentence high-level vector representation output by the encoder into a program sentence vector representation sequence with indefinite length. Therefore, in terms of model structure, its sequence generation results can be directly applied to statement-level vulnerability detection. On the other hand, the seq2seq model can consider both local and global semantic information of statements in a sample program. Because vulnerability statements have high dependency on context, conventional RNN or CNN models can only extract short dependencies between adjacent statements, and it is difficult to capture long dependency information between statements that are far apart. In contrast, the seq2seq model encodes statements related to all vulnerabilities in the sample program, and uses the statements and global semantic information acquired based on the dual attention mechanism together to guide generation of statement vector sequences in a decoder, so that a more accurate statement vector representation sequence can be generated, and the statement sequence generation result can be directly used for statement-level fine-grained vulnerability detection.

Compared with the prior art, the invention has the following advantages:

(1) the method for directly representing and learning the slice code segment generated based on the program slicing technology by using the seq2seq deep learning model can fully utilize and learn the local and global semantic information of the vulnerability code, and the statement vector representation obtained based on the sequence generation model is more favorable for realizing statement-level fine-grained vulnerability detection.

(2) According to the method, a double attention mechanism based on self attention and text attention is introduced between an Encoder (Encoder) and a Decoder (Decoder) of a seq2seq model, so that the importance degree of a program statement to a vulnerability can be effectively learned while global semantic information is learned, and the accuracy of statement-level fine-grained vulnerability detection is improved.

Drawings

Fig. 1 is a schematic flow diagram of a source code fine-grained vulnerability detection method according to the present invention.

FIG. 2 is an example of vulnerability code.

FIG. 3 is a statement-level fine-grained vulnerability detection process.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.

The invention provides a fine-grained vulnerability detection method based on seq2seq code representation learning, which comprises the steps of firstly extracting vulnerability candidate key nodes as slicing criteria, and then extracting slicing code segments in a program by using a program slicing technology. Secondly, performing representation learning on the slice code segments based on a seq2seq deep learning model, converting the representation learning problem of sentences in the code into a sequence generation problem of the sentences, and realizing end-to-end learning of global and local semantic information of the code. An encoder consisting of a sentence encoding network and a program encoding network is constructed, and sentences in the sliced code segments are subjected to segmented representation learning, namely, the sentence encoding network is used for extracting the dependency relationship between tokens in the sentences, learning local semantic information, and then the program encoding network is used for learning the context information between the sentences. In order to be able to better learn the global semantics related to the vulnerability, the invention introduces a dual attention mechanism based on self attention and text attention between the Encoder and the Decoder. The self-attention mechanism is introduced to learn the dependency relationship between sentences in the program, and the text attention mechanism is introduced to obtain global semantic information related to the vulnerability. And finally, learning long dependence information among the sentences by using a decoder network and further combining global semantic information to obtain final vector representation of each sentence, and sending the final vector representation into a detector to realize sentence-level fine-grained vulnerability detection. As shown in fig. 1, the method comprises the following specific steps:

step 1: and analyzing the source code by using a static analysis tool to generate an abstract syntax tree and a program dependency graph.

And 2, step: extracting vulnerability candidate key nodes of a source code by using an abstract syntax tree to serve as a slicing criterion, generating a slicing code segment of the source code, and carrying out standardization processing on the slicing code segment to obtain a named and standardized slicing code segment;

and step 3: and performing representation learning on token sequences in the slice code segment sentences by using a sentence coding network in a seq2seq deep learning model coder to generate sentence primary vector representation containing local semantic information. The method comprises the following specific steps:

step 31: and splitting the sentences in the section code segment into tokens, and using a pre-trained word2vec word embedding model to obtain vector representation of each token to form a token vector matrix.

Step 32: and (4) sending the token vector matrix generated in the step (31) into a sentence coding network realized by a GRU, learning to obtain the hidden vector representation of each token, and performing weighted summation on the hidden vector representations of all tokens through learnable weights to obtain the primary vector representation of the sentence.

The specific calculation formula is as follows:

z＝σ(W _z ·w _t +U _z ·h _(t-1) +b _z )

r＝σ(W _r ·w _t +U _r ·h _(t-1) +b _r )

where z and r represent the update gate and reset gate, respectively, σ is the activation function, w _t T-th in the presentation statementInitial vector representation of token, h _t And

respectively representing the hidden state and the intermediate temporary state, W, of the t token _z 、W _r 、W _h 、U _z 、U _r 、U _h 、U _t Is a learnable weight parameter, b _z 、b _r 、b _h For bias terms, x is the statement vector representation and n is the total number of tokens in the statement.

And 4, step 4: and (4) transmitting the sentence sequence formed by the primary vector representation of each sentence obtained in the step (3) into a program coding network in a seq2seq deep learning model coder, and learning to obtain a high-level vector representation of the sentence containing the sentence context information. The method comprises the following specific steps:

step 41: and padding the vector representation of the statement in the slice code segment to obtain an initialized statement vector matrix consisting of statement vectors.

Step 42: and (4) sending the initialized statement vector matrix generated in the step (41) into a program coding network realized by the BiGRU, and learning to obtain the hidden vector representation of the statement in the program.

The specific calculation formula is as follows:

wherein x is _i The vector representation representing the ith statement in the slicing code segment is obtained by the statement coding network in step 3, L is the total number of statements in the slicing code segment,

is through forward GRU unit

The resulting statement is forward to a hidden state,

then reverse hidden state, e _i The high-level vector representation of the statement finally obtained in the step.

And 5: and (4) transmitting the sentence sequence formed by the high-level vector expression of the sentence obtained in the step (4) into a double attention module based on self attention and text attention, learning the dependency relationship among the sentences through the self attention, and learning the global semantic information related to the vulnerability through the text attention. The method comprises the following specific steps:

step 51: learning the dependency relationship among the sentences through a self-attention mechanism to obtain a hidden vector matrix of the sentences, wherein a specific calculation formula is as follows:

MutiHead(Q,K,V)＝(head ₁ ||head ₂ ||…||head _a )

X ^se ＝MutiHead(E,E,E)

wherein E ═ E ₁ ,e ₂ ,…,e _L ]A matrix composed for statement high-level vector representation; mutihead is a multi-headed self-attention method that can map a statement vector representation to another fixed-length statement vector, head _p Is the p-th head function in the multi-head attention method, and a is the total head number;

is the slice generation obtained after the self-attention extractionA code segment statement vector matrix; q, K, V, respectively representing a query vector, a target vector and a value vector in the Attention function Attention, and Q, K, V represents E as sentence vectors due to the adoption of an Attention mechanism; c is the vector dimension;

is a learnable weight matrix.

Step 52: and (4) calculating global semantic information by using a text attention mechanism and the statement vector matrix obtained in the step 51, wherein a specific calculation formula is as follows:

u _txt ＝ωu _rand +b

wherein u is _rand Obtaining u for vectors initialized at random after linear layer processing _txt ，

For the statement vector obtained in step 51, combine u _txt Can be used to calculate the text attention value alpha _i G is global semantic information finally obtained in the step, omega is a weight parameter which can be learned, and b is a bias term.

Step 6: and (4) taking the sentence sequence formed by the vulnerability-related global semantic information obtained in the step (5) and the sentence high-level vector representation obtained in the step (4) as input, sending the sentence sequence into a decoder network of a seq2seq deep learning model to learn long dependency information among the sentences, and generating final vector representation of the sentences. The method comprises the following specific steps:

step 61: in a decoder network, BiGRU is used as a main model, but the calculation method of the network unit is different from that of the traditional GRU, and the specific calculation formula is as follows:

z'＝σ(W' _Z ·e _i +U' _Z ·h' _(i-1) +C _z g+b' _z )

r′＝σ(W′ _r ·e _i +U′ _r ·h′ _(i-1) +C _r g+b′ _r )

we use- ' to distinguish between similar parameters as in the encoder, z ' and r ' represent the update gate and reset gate, respectively, e _i The high-level vector representation of the ith statement in the slice code segment is obtained by the program coding network in the step 4, and g is the global semantic information vector h 'obtained in the step 5' _i And

respectively representing the hidden state and the intermediate temporary state of the generated ith statement, sigma being an activation function, W' _Z 、W' _r 、W' _h 、U' _Z 、U' _r 、U' _h 、C _z 、C _r 、C _h Is a learnable weight parameter, b' _z 、b' _r 、b' _h Is the bias term.

Step 62: we will note the GRU unit altered at step 61 as GRU _D Then the formula of the decoder network is as follows:

wherein the content of the first and second substances,

is through a forward decoder GRU unit

The resulting statement is forward to a hidden state,

then reverse hidden state, d _i And representing the final vector of the statement finally obtained in the step.

And 7: and (4) sending the final vector representation of each statement obtained in the step (6) into a detector network consisting of a multilayer perceptron MLP and a softmax layer to obtain a prediction result of whether the statement has a hole or not, calculating a cross entropy loss function by using label information of the statement, adjusting network parameters according to error back propagation until the loss value does not decrease any more, and finishing training.

And 8: performing statement-level fine-grained vulnerability detection on the code by using the trained model, and specifically comprising the following steps:

and after the sample to be detected is input into the model, the probability that each statement in the program contains a bug can be output, when the probability is less than 0.5, the statement is a non-bug statement, otherwise, the statement is a bug statement.

Example (b):

taking the code bug shown in fig. 2 as an example, two statements in the code are bug statements, which are respectively line 8 and line 13. In a seq2seq model testing stage, firstly, a word2vec model is established through a source code corpus in a preprocessing stage to obtain a vector representation corresponding to token of each statement. Then, a program slicing technology and a static analysis tool are used for extracting slicing code segments in a source code, vector representation of token sequences in the slicing code segments is used as input of an encoder, the vector representation passes through an encoder network, an attention mechanism module, a decoder network and a detector network, and finally a vulnerability prediction result of each statement is obtained.

Claims

1. A fine-grained vulnerability detection method based on seq2seq code representation learning is characterized by comprising the following steps:

and 7: sending the final vector representation of each statement obtained in the step (6) into a detector network consisting of a plurality of layers of perceptrons (MLP) and softmax layers to obtain a prediction result of whether the statement has a hole or not, calculating a cross entropy loss function by using label information of the statement, adjusting network parameters according to error back propagation until a loss value does not decrease any more, and finishing training;

2. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 1, wherein the specific steps of the step 3 are as follows:

step 31: splitting the sentences in the section code segment into tokens, and using a pre-trained word2vec word embedding model to obtain vector representation of each token to form a token vector matrix;

step 32: and (3) sending the token vector matrix generated in the step (31) into a sentence coding network realized by a GRU, learning to obtain the hidden vector representation of each token, and performing weighted summation on the hidden vector representations of all the tokens through learnable weights to obtain the primary vector representation of the sentence.

3. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 2, wherein the specific calculation formula of the step 32 is as follows:

z＝σ(W _z ·w _t +U _z ·h _(t-1) +b _z )

r＝σ(W _r ·w _t +U _r ·h _(t-1) +b _r )

where z and r represent the update gate and reset gate, respectively, σ is the activation function, w _t Initial vector representation, h, representing the t token in the sentence _t And

respectively representing a hidden state and an intermediate temporary state, W, of the t-th token _z 、W _r 、W _h 、U _z 、U _r 、U _h 、U _t Is a learnable weight parameter, b _z 、b _r 、b _h For bias terms, x is the statement vector representation and n is the total number of tokens in the statement.

4. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 1, wherein the specific steps of the step 4 are as follows:

step 41: padding vector representation of statements in the slice code segment to obtain an initialized statement vector matrix consisting of statement vectors;

step 42: and (4) sending the initialized statement vector matrix generated in the step (41) into a program coding network realized by a BiGRU, and learning to obtain the hidden vector representation of the statement in the program.

5. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 4, wherein the specific calculation formula of the step 42 is as follows:

wherein x is _i A vector representation representing the ith statement in the slicing code segment, L being the total number of statements in the slicing code segment,

is through forward GRU unit

The resulting statement is forward to a hidden state,

then reverse hidden state, e _i Is a statement high-level vector representation.

6. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 1, wherein the specific steps of the step 5 are as follows:

step 51: learning the dependency relationship among the sentences through a self-attention mechanism to obtain a hidden vector matrix of the sentences;

step 52: the global semantic information is calculated using the text attention mechanism and the sentence vector matrix obtained in step 51.

7. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 6, wherein the specific calculation formula of the step 51 is as follows:

MutiHead(Q,K,V)＝(head ₁ ||head ₂ ||…||head _a )

X ^se ＝MutiHead(E,E,E)

wherein E ═ E ₁ ,e ₂ ,…,e _L ]A matrix composed for statement high-level vector representation; mutihead is a multi-head self-attention method, head _p Is the p-th head function in the multi-head attention method, and a is the total head number;

is a statement vector matrix of the sliced code segment obtained after self-attention extraction; q, K, V denotes the query vector, target vector and value vector from the Attention function Attention, respectively; c is the vector dimension;

is a learnable weight matrix.

8. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 6, wherein the specific calculation formula of the step 52 is as follows:

u _txt ＝ωu _rand +b

For the statement vector obtained in step 51, combine u _txt For calculating the text attention value alpha _i G is the global semantic information, ω is a learnable weight parameter, and b is a bias term.

9. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 1, wherein the specific steps of the step 6 are as follows:

step 61: in a decoder network, using BiGRU as a main model, the specific calculation formula of the network unit is as follows:

z'＝σ(W' _Z ·e _i +U' _Z ·h' _(i-1) +C _z g+b' _z )

r′＝σ(W′ _r ·e _i +U′ _r ·h′ _(i-1) +C _r g+b′ _r )

z 'and r' represent the update gate and the reset gate, respectively, e _i High level vector representation representing the ith statement in a sliced code segment, g being the global semantic information vector, h' _i And

respectively representing the hidden state and the intermediate temporary state of the generated ith statement, wherein sigma is an activation function, W' _Z 、W' _r 、W' _h 、U' _Z 、U' _r 、U' _h 、C _z 、C _r 、C _h Is a learnable weight parameter, b' _z 、b' _r 、b' _h Is a bias term;

step 62: recording the GRU unit changed in the step 61 as GRU _D Then the formula of the decoder network is as follows:

wherein the content of the first and second substances,

is through a forward decoder GRU unit

The resulting statement is forward to a hidden state,