CN114969763A - Fine-grained vulnerability detection method based on seq2seq code representation learning - Google Patents
Fine-grained vulnerability detection method based on seq2seq code representation learning Download PDFInfo
- Publication number
- CN114969763A CN114969763A CN202210700763.1A CN202210700763A CN114969763A CN 114969763 A CN114969763 A CN 114969763A CN 202210700763 A CN202210700763 A CN 202210700763A CN 114969763 A CN114969763 A CN 114969763A
- Authority
- CN
- China
- Prior art keywords
- statement
- sentence
- vector
- code
- learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 30
- 239000013598 vector Substances 0.000 claims abstract description 102
- 238000000034 method Methods 0.000 claims abstract description 25
- 238000013136 deep learning model Methods 0.000 claims abstract description 15
- 239000011159 matrix material Substances 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 13
- 230000007246 mechanism Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 4
- 230000003068 static effect Effects 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 2
- 238000000605 extraction Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 6
- 238000013135 deep learning Methods 0.000 abstract description 4
- 238000013145 classification model Methods 0.000 abstract description 2
- 230000009977 dual effect Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a fine-grained vulnerability detection method based on seq2seq code representation learning, which comprises the steps of firstly extracting vulnerability candidate key nodes as slicing criteria, and then extracting slicing code segments in a program by using a program slicing technology. Then, a seq2 seq-based deep learning model is used for representing and learning the sliced code segments, a sentence vector representation sequence containing long dependency relations among sentences is generated, the vector representation of each sentence in the sequence is sent to a detector, and whether the sentence is a bug sentence or not is detected. The method can fully utilize global and local semantic information in the code to learn characteristics related to the vulnerability in the statement and between statements, avoid the problem that the long dependence information between the vulnerability statement and the context thereof is difficult to capture when the traditional deep learning classification model is used for representing and learning the code, utilize the seq2seq model to represent and learn the generated statement vector representation sequence, and is more suitable for statement-level fine-grained vulnerability detection.
Description
Technical Field
The invention relates to a software vulnerability fine-grained detection method, in particular to a fine-grained vulnerability detection method based on seq2seq code representation learning.
Background
A vulnerability is defined as a security flaw in a computer system that not only threatens the system itself, but also makes it difficult for the system to guarantee the confidentiality, integrity and availability of application data and compromises the security of the system. In recent years, with the increase in system scale and complexity and the introduction of new technologies, the possibility of software bugs has increased. Automatic detection of vulnerabilities is an important means to reduce vulnerabilities. However, at present, most of the vulnerability detection research based on deep learning focuses on coarse-grained levels such as files, functions or fragments, that is, only the possibility that files, functions or codes contain vulnerabilities can be predicted, and in such a situation, the difficulty of manually positioning vulnerability statements by developers is high, so that vulnerabilities cannot be repaired in time. Therefore, the method for detecting the fine-grained vulnerability at the statement level is helpful for developers to understand and quickly repair the vulnerability, and is also the trend of the current vulnerability detection research. In recent years, although there is a small amount of research on fine-grained vulnerability detection methods, the positioning accuracy still needs to be improved. Sparsity and discontinuity of vulnerability information distribution, context dependency among vulnerability statements, complexity and concealment of vulnerability characteristics all provide challenges for fine-grained vulnerability detection.
Disclosure of Invention
The invention aims to provide a fine-grained vulnerability detection method based on seq2seq code representation learning, which can fully utilize global and local semantic information in a code and learn characteristics related to vulnerability in sentences and among sentences, avoid the problem that the long dependence information between the vulnerability sentences and the context thereof is difficult to capture when a traditional deep learning classification model is used for representing and learning the code, utilize a sentence vector representation sequence generated by representing and learning the code by using a seq2seq model, and is more suitable for sentence-level fine-grained vulnerability detection.
The purpose of the invention is realized by the following technical scheme:
a fine-grained vulnerability detection method based on seq2seq code representation learning comprises the steps of firstly, extracting vulnerability candidate key nodes as slicing criteria, and then extracting slicing code segments in a program by using a program slicing technology. Then, the slice code segment is representation-learned by using a seq2 seq-based deep learning model. Specifically, firstly, a sentence coding network in a seq2seq model encoder is used to perform representation learning on token sequences in the slice code segment sentences, and sentence primary vector representation containing local semantic information is generated. The term vector sequence is then representation-learned using a program coding network in the encoder, using as input a term sequence formed from the obtained primary vector representation for each term, to generate a term high-level vector representation containing term context information. Then, global semantic information related to dependencies and vulnerabilities between sentences in the learner is learned in tandem using a dual attention mechanism based on self-attention and textual attention. And finally, using the statement high-level vector representation generated by the encoder as input and combining global semantic information, generating a statement final vector representation sequence containing long dependency relations among statements by using a decoder network in a seq2seq model encoder, sending the final vector representation of each statement in the sequence into a detector, and detecting whether each statement is a bug statement. The method specifically comprises the following steps:
step 1: analyzing the source code by using a static analysis tool to generate an abstract syntax tree and a program dependency graph;
step 2: extracting vulnerability candidate key nodes of a source code by using an abstract syntax tree to serve as a slicing criterion, generating a slicing code segment of the source code, and carrying out standardization processing on the slicing code segment to obtain a named and standardized slicing code segment;
and step 3: using a sentence coding network in a seq2seq deep learning model coder to perform representation learning on token sequences in the slice code segment sentences to generate sentence primary vector representation containing local semantic information;
and 4, step 4: taking a sentence sequence formed by the primary vector representation of each sentence obtained in the step 3 as input, performing representation learning on the sentence sequence by using a program coding network in a seq2seq deep learning model coder, and generating a sentence high-level vector representation containing sentence context information;
and 5: transmitting the sentence sequence formed by the high-level vector expression of the sentence obtained in the step 4 into a double attention module based on self attention and text attention, learning the dependency relationship among the sentences through the self attention, and then learning the global semantic information related to the vulnerability through the text attention;
step 6: taking the sentence sequence formed by the vulnerability-related global semantic information obtained in the step 5 and the sentence high-level vector representation obtained in the step 4 as input, sending the sentence sequence into a long dependence information between learning sentences in a decoder network of a seq2seq deep learning model, and generating a final vector representation of the sentences;
and 7: sending the final vector representation of each statement obtained in the step 6 into a detector network consisting of a multilayer perceptron MLP and a softmax layer to obtain a prediction result of whether the statement has a hole or not, calculating a cross entropy loss function by using label information of the statement, adjusting network parameters according to error back propagation until a loss value does not decrease any more, and finishing training;
and 8: and carrying out statement-level fine-grained vulnerability detection on the code by using the trained model.
The deep learning model based on seq2seq is used for representing and learning codes, wherein a sentence coding network in a coder is used for extracting the dependency relationship between tokens in sentences to obtain local semantic information; the program coding network is used for learning the context information among the sentences; the double attention mechanism is used for acquiring global semantic information related to the vulnerability; the decoder network is used for acquiring long dependency relations among the sentences; the detector network is used for outputting a prediction result of whether the statement has a bug or not.
The existing vulnerability detection method is based on the code representation learning of deep learning models such as a convolutional neural network, a cyclic neural network or a graph neural network and the like to realize vulnerability detection. Different from the existing method, the invention provides a fine-grained vulnerability detection method based on seq2seq code representation learning, and the method uses a seq2seq model for statement-level fine-grained vulnerability detection task for the first time. Unlike the traditional binary model based on deep learning, the seq2seq model commonly used in machine translation task can directly implement sentence-to-sentence mapping, and the seq2seq model is composed of an encoder (encoder) and a decoder (decoder), wherein the encoder is used for encoding the input program sentence sequence with indefinite length into a sentence high-level vector representation with fixed length, and the decoder is used for re-decoding the sentence high-level vector representation output by the encoder into a program sentence vector representation sequence with indefinite length. Therefore, in terms of model structure, its sequence generation results can be directly applied to statement-level vulnerability detection. On the other hand, the seq2seq model can consider both local and global semantic information of statements in a sample program. Because vulnerability statements have high dependency on context, conventional RNN or CNN models can only extract short dependencies between adjacent statements, and it is difficult to capture long dependency information between statements that are far apart. In contrast, the seq2seq model encodes statements related to all vulnerabilities in the sample program, and uses the statements and global semantic information acquired based on the dual attention mechanism together to guide generation of statement vector sequences in a decoder, so that a more accurate statement vector representation sequence can be generated, and the statement sequence generation result can be directly used for statement-level fine-grained vulnerability detection.
Compared with the prior art, the invention has the following advantages:
(1) the method for directly representing and learning the slice code segment generated based on the program slicing technology by using the seq2seq deep learning model can fully utilize and learn the local and global semantic information of the vulnerability code, and the statement vector representation obtained based on the sequence generation model is more favorable for realizing statement-level fine-grained vulnerability detection.
(2) According to the method, a double attention mechanism based on self attention and text attention is introduced between an Encoder (Encoder) and a Decoder (Decoder) of a seq2seq model, so that the importance degree of a program statement to a vulnerability can be effectively learned while global semantic information is learned, and the accuracy of statement-level fine-grained vulnerability detection is improved.
Drawings
Fig. 1 is a schematic flow diagram of a source code fine-grained vulnerability detection method according to the present invention.
FIG. 2 is an example of vulnerability code.
FIG. 3 is a statement-level fine-grained vulnerability detection process.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.
The invention provides a fine-grained vulnerability detection method based on seq2seq code representation learning, which comprises the steps of firstly extracting vulnerability candidate key nodes as slicing criteria, and then extracting slicing code segments in a program by using a program slicing technology. Secondly, performing representation learning on the slice code segments based on a seq2seq deep learning model, converting the representation learning problem of sentences in the code into a sequence generation problem of the sentences, and realizing end-to-end learning of global and local semantic information of the code. An encoder consisting of a sentence encoding network and a program encoding network is constructed, and sentences in the sliced code segments are subjected to segmented representation learning, namely, the sentence encoding network is used for extracting the dependency relationship between tokens in the sentences, learning local semantic information, and then the program encoding network is used for learning the context information between the sentences. In order to be able to better learn the global semantics related to the vulnerability, the invention introduces a dual attention mechanism based on self attention and text attention between the Encoder and the Decoder. The self-attention mechanism is introduced to learn the dependency relationship between sentences in the program, and the text attention mechanism is introduced to obtain global semantic information related to the vulnerability. And finally, learning long dependence information among the sentences by using a decoder network and further combining global semantic information to obtain final vector representation of each sentence, and sending the final vector representation into a detector to realize sentence-level fine-grained vulnerability detection. As shown in fig. 1, the method comprises the following specific steps:
step 1: and analyzing the source code by using a static analysis tool to generate an abstract syntax tree and a program dependency graph.
And 2, step: extracting vulnerability candidate key nodes of a source code by using an abstract syntax tree to serve as a slicing criterion, generating a slicing code segment of the source code, and carrying out standardization processing on the slicing code segment to obtain a named and standardized slicing code segment;
and step 3: and performing representation learning on token sequences in the slice code segment sentences by using a sentence coding network in a seq2seq deep learning model coder to generate sentence primary vector representation containing local semantic information. The method comprises the following specific steps:
step 31: and splitting the sentences in the section code segment into tokens, and using a pre-trained word2vec word embedding model to obtain vector representation of each token to form a token vector matrix.
Step 32: and (4) sending the token vector matrix generated in the step (31) into a sentence coding network realized by a GRU, learning to obtain the hidden vector representation of each token, and performing weighted summation on the hidden vector representations of all tokens through learnable weights to obtain the primary vector representation of the sentence.
The specific calculation formula is as follows:
z=σ(W z ·w t +U z ·h (t-1) +b z )
r=σ(W r ·w t +U r ·h (t-1) +b r )
where z and r represent the update gate and reset gate, respectively, σ is the activation function, w t T-th in the presentation statementInitial vector representation of token, h t Andrespectively representing the hidden state and the intermediate temporary state, W, of the t token z 、W r 、W h 、U z 、U r 、U h 、U t Is a learnable weight parameter, b z 、b r 、b h For bias terms, x is the statement vector representation and n is the total number of tokens in the statement.
And 4, step 4: and (4) transmitting the sentence sequence formed by the primary vector representation of each sentence obtained in the step (3) into a program coding network in a seq2seq deep learning model coder, and learning to obtain a high-level vector representation of the sentence containing the sentence context information. The method comprises the following specific steps:
step 41: and padding the vector representation of the statement in the slice code segment to obtain an initialized statement vector matrix consisting of statement vectors.
Step 42: and (4) sending the initialized statement vector matrix generated in the step (41) into a program coding network realized by the BiGRU, and learning to obtain the hidden vector representation of the statement in the program.
The specific calculation formula is as follows:
wherein x is i The vector representation representing the ith statement in the slicing code segment is obtained by the statement coding network in step 3, L is the total number of statements in the slicing code segment,is through forward GRU unitThe resulting statement is forward to a hidden state,then reverse hidden state, e i The high-level vector representation of the statement finally obtained in the step.
And 5: and (4) transmitting the sentence sequence formed by the high-level vector expression of the sentence obtained in the step (4) into a double attention module based on self attention and text attention, learning the dependency relationship among the sentences through the self attention, and learning the global semantic information related to the vulnerability through the text attention. The method comprises the following specific steps:
step 51: learning the dependency relationship among the sentences through a self-attention mechanism to obtain a hidden vector matrix of the sentences, wherein a specific calculation formula is as follows:
MutiHead(Q,K,V)=(head 1 ||head 2 ||…||head a )
X se =MutiHead(E,E,E)
wherein E ═ E 1 ,e 2 ,…,e L ]A matrix composed for statement high-level vector representation; mutihead is a multi-headed self-attention method that can map a statement vector representation to another fixed-length statement vector, head p Is the p-th head function in the multi-head attention method, and a is the total head number;is the slice generation obtained after the self-attention extractionA code segment statement vector matrix; q, K, V, respectively representing a query vector, a target vector and a value vector in the Attention function Attention, and Q, K, V represents E as sentence vectors due to the adoption of an Attention mechanism; c is the vector dimension;is a learnable weight matrix.
Step 52: and (4) calculating global semantic information by using a text attention mechanism and the statement vector matrix obtained in the step 51, wherein a specific calculation formula is as follows:
u txt =ωu rand +b
wherein u is rand Obtaining u for vectors initialized at random after linear layer processing txt ,For the statement vector obtained in step 51, combine u txt Can be used to calculate the text attention value alpha i G is global semantic information finally obtained in the step, omega is a weight parameter which can be learned, and b is a bias term.
Step 6: and (4) taking the sentence sequence formed by the vulnerability-related global semantic information obtained in the step (5) and the sentence high-level vector representation obtained in the step (4) as input, sending the sentence sequence into a decoder network of a seq2seq deep learning model to learn long dependency information among the sentences, and generating final vector representation of the sentences. The method comprises the following specific steps:
step 61: in a decoder network, BiGRU is used as a main model, but the calculation method of the network unit is different from that of the traditional GRU, and the specific calculation formula is as follows:
z'=σ(W' Z ·e i +U' Z ·h' (i-1) +C z g+b' z )
r′=σ(W′ r ·e i +U′ r ·h′ (i-1) +C r g+b′ r )
we use- ' to distinguish between similar parameters as in the encoder, z ' and r ' represent the update gate and reset gate, respectively, e i The high-level vector representation of the ith statement in the slice code segment is obtained by the program coding network in the step 4, and g is the global semantic information vector h 'obtained in the step 5' i Andrespectively representing the hidden state and the intermediate temporary state of the generated ith statement, sigma being an activation function, W' Z 、W' r 、W' h 、U' Z 、U' r 、U' h 、C z 、C r 、C h Is a learnable weight parameter, b' z 、b' r 、b' h Is the bias term.
Step 62: we will note the GRU unit altered at step 61 as GRU D Then the formula of the decoder network is as follows:
wherein the content of the first and second substances,is through a forward decoder GRU unitThe resulting statement is forward to a hidden state,then reverse hidden state, d i And representing the final vector of the statement finally obtained in the step.
And 7: and (4) sending the final vector representation of each statement obtained in the step (6) into a detector network consisting of a multilayer perceptron MLP and a softmax layer to obtain a prediction result of whether the statement has a hole or not, calculating a cross entropy loss function by using label information of the statement, adjusting network parameters according to error back propagation until the loss value does not decrease any more, and finishing training.
And 8: performing statement-level fine-grained vulnerability detection on the code by using the trained model, and specifically comprising the following steps:
and after the sample to be detected is input into the model, the probability that each statement in the program contains a bug can be output, when the probability is less than 0.5, the statement is a non-bug statement, otherwise, the statement is a bug statement.
Example (b):
taking the code bug shown in fig. 2 as an example, two statements in the code are bug statements, which are respectively line 8 and line 13. In a seq2seq model testing stage, firstly, a word2vec model is established through a source code corpus in a preprocessing stage to obtain a vector representation corresponding to token of each statement. Then, a program slicing technology and a static analysis tool are used for extracting slicing code segments in a source code, vector representation of token sequences in the slicing code segments is used as input of an encoder, the vector representation passes through an encoder network, an attention mechanism module, a decoder network and a detector network, and finally a vulnerability prediction result of each statement is obtained.
Claims (9)
1. A fine-grained vulnerability detection method based on seq2seq code representation learning is characterized by comprising the following steps:
step 1: analyzing the source code by using a static analysis tool to generate an abstract syntax tree and a program dependency graph;
step 2: extracting vulnerability candidate key nodes of a source code by using an abstract syntax tree to serve as a slicing criterion, generating a slicing code segment of the source code, and carrying out standardization processing on the slicing code segment to obtain a named and standardized slicing code segment;
and step 3: using a sentence coding network in a seq2seq deep learning model coder to perform representation learning on token sequences in the slice code segment sentences to generate sentence primary vector representation containing local semantic information;
and 4, step 4: taking a sentence sequence formed by the primary vector representation of each sentence obtained in the step 3 as input, performing representation learning on the sentence sequence by using a program coding network in a seq2seq deep learning model coder, and generating a sentence high-level vector representation containing sentence context information;
and 5: transmitting the sentence sequence formed by the high-level vector expression of the sentence obtained in the step 4 into a double attention module based on self attention and text attention, learning the dependency relationship among the sentences through the self attention, and then learning the global semantic information related to the vulnerability through the text attention;
step 6: taking the sentence sequence formed by the vulnerability-related global semantic information obtained in the step 5 and the sentence high-level vector representation obtained in the step 4 as input, sending the sentence sequence into a long dependence information between learning sentences in a decoder network of a seq2seq deep learning model, and generating a final vector representation of the sentences;
and 7: sending the final vector representation of each statement obtained in the step (6) into a detector network consisting of a plurality of layers of perceptrons (MLP) and softmax layers to obtain a prediction result of whether the statement has a hole or not, calculating a cross entropy loss function by using label information of the statement, adjusting network parameters according to error back propagation until a loss value does not decrease any more, and finishing training;
and 8: and carrying out statement-level fine-grained vulnerability detection on the code by using the trained model.
2. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 1, wherein the specific steps of the step 3 are as follows:
step 31: splitting the sentences in the section code segment into tokens, and using a pre-trained word2vec word embedding model to obtain vector representation of each token to form a token vector matrix;
step 32: and (3) sending the token vector matrix generated in the step (31) into a sentence coding network realized by a GRU, learning to obtain the hidden vector representation of each token, and performing weighted summation on the hidden vector representations of all the tokens through learnable weights to obtain the primary vector representation of the sentence.
3. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 2, wherein the specific calculation formula of the step 32 is as follows:
z=σ(W z ·w t +U z ·h (t-1) +b z )
r=σ(W r ·w t +U r ·h (t-1) +b r )
where z and r represent the update gate and reset gate, respectively, σ is the activation function, w t Initial vector representation, h, representing the t token in the sentence t Andrespectively representing a hidden state and an intermediate temporary state, W, of the t-th token z 、W r 、W h 、U z 、U r 、U h 、U t Is a learnable weight parameter, b z 、b r 、b h For bias terms, x is the statement vector representation and n is the total number of tokens in the statement.
4. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 1, wherein the specific steps of the step 4 are as follows:
step 41: padding vector representation of statements in the slice code segment to obtain an initialized statement vector matrix consisting of statement vectors;
step 42: and (4) sending the initialized statement vector matrix generated in the step (41) into a program coding network realized by a BiGRU, and learning to obtain the hidden vector representation of the statement in the program.
5. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 4, wherein the specific calculation formula of the step 42 is as follows:
wherein x is i A vector representation representing the ith statement in the slicing code segment, L being the total number of statements in the slicing code segment,is through forward GRU unitThe resulting statement is forward to a hidden state,then reverse hidden state, e i Is a statement high-level vector representation.
6. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 1, wherein the specific steps of the step 5 are as follows:
step 51: learning the dependency relationship among the sentences through a self-attention mechanism to obtain a hidden vector matrix of the sentences;
step 52: the global semantic information is calculated using the text attention mechanism and the sentence vector matrix obtained in step 51.
7. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 6, wherein the specific calculation formula of the step 51 is as follows:
MutiHead(Q,K,V)=(head 1 ||head 2 ||…||head a )
X se =MutiHead(E,E,E)
wherein E ═ E 1 ,e 2 ,…,e L ]A matrix composed for statement high-level vector representation; mutihead is a multi-head self-attention method, head p Is the p-th head function in the multi-head attention method, and a is the total head number;is a statement vector matrix of the sliced code segment obtained after self-attention extraction; q, K, V denotes the query vector, target vector and value vector from the Attention function Attention, respectively; c is the vector dimension; is a learnable weight matrix.
8. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 6, wherein the specific calculation formula of the step 52 is as follows:
u txt =ωu rand +b
wherein u is rand Obtaining u for vectors initialized at random after linear layer processing txt ,For the statement vector obtained in step 51, combine u txt For calculating the text attention value alpha i G is the global semantic information, ω is a learnable weight parameter, and b is a bias term.
9. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 1, wherein the specific steps of the step 6 are as follows:
step 61: in a decoder network, using BiGRU as a main model, the specific calculation formula of the network unit is as follows:
z'=σ(W' Z ·e i +U' Z ·h' (i-1) +C z g+b' z )
r′=σ(W′ r ·e i +U′ r ·h′ (i-1) +C r g+b′ r )
z 'and r' represent the update gate and the reset gate, respectively, e i High level vector representation representing the ith statement in a sliced code segment, g being the global semantic information vector, h' i Andrespectively representing the hidden state and the intermediate temporary state of the generated ith statement, wherein sigma is an activation function, W' Z 、W' r 、W' h 、U' Z 、U' r 、U' h 、C z 、C r 、C h Is a learnable weight parameter, b' z 、b' r 、b' h Is a bias term;
step 62: recording the GRU unit changed in the step 61 as GRU D Then the formula of the decoder network is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210700763.1A CN114969763A (en) | 2022-06-20 | 2022-06-20 | Fine-grained vulnerability detection method based on seq2seq code representation learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210700763.1A CN114969763A (en) | 2022-06-20 | 2022-06-20 | Fine-grained vulnerability detection method based on seq2seq code representation learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114969763A true CN114969763A (en) | 2022-08-30 |
Family
ID=82964430
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210700763.1A Pending CN114969763A (en) | 2022-06-20 | 2022-06-20 | Fine-grained vulnerability detection method based on seq2seq code representation learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114969763A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115422092A (en) * | 2022-11-03 | 2022-12-02 | 杭州金衡和信息科技有限公司 | Software bug positioning method based on multi-method fusion |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB201917161D0 (en) * | 2019-08-23 | 2020-01-08 | Praetorian | System and method for automatically detecting a security vulnerability in a source code using a machine learning model |
CN111753303A (en) * | 2020-07-29 | 2020-10-09 | 哈尔滨工业大学 | Multi-granularity code vulnerability detection method based on deep learning and reinforcement learning |
CN112699377A (en) * | 2020-12-30 | 2021-04-23 | 哈尔滨工业大学 | Function-level code vulnerability detection method based on slice attribute graph representation learning |
CN114064487A (en) * | 2021-11-18 | 2022-02-18 | 北京京航计算通讯研究所 | Code defect detection method |
-
2022
- 2022-06-20 CN CN202210700763.1A patent/CN114969763A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB201917161D0 (en) * | 2019-08-23 | 2020-01-08 | Praetorian | System and method for automatically detecting a security vulnerability in a source code using a machine learning model |
CN111753303A (en) * | 2020-07-29 | 2020-10-09 | 哈尔滨工业大学 | Multi-granularity code vulnerability detection method based on deep learning and reinforcement learning |
CN112699377A (en) * | 2020-12-30 | 2021-04-23 | 哈尔滨工业大学 | Function-level code vulnerability detection method based on slice attribute graph representation learning |
CN114064487A (en) * | 2021-11-18 | 2022-02-18 | 北京京航计算通讯研究所 | Code defect detection method |
Non-Patent Citations (3)
Title |
---|
DAM H, TRAN T, PHAM T: "Automatic Feature Learning for Vulnerability Prediction", Retrieved from the Internet <URL:http://arxiv.org/abs/1708.02368> * |
WEINING ZHENG; YUAN JIANG; XIAOHONG SU: "Vu1SPG: Vulnerability detection based on slice property graph representation learning", 2021 IEEE 32ND INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING (ISSRE), 11 February 2022 (2022-02-11) * |
邹德清: "基于图结构源代码切片的智能化漏洞检测系统", 《网络与信息安全学报》, 31 October 2021 (2021-10-31) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115422092A (en) * | 2022-11-03 | 2022-12-02 | 杭州金衡和信息科技有限公司 | Software bug positioning method based on multi-method fusion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110929030B (en) | Text abstract and emotion classification combined training method | |
US20210232773A1 (en) | Unified Vision and Dialogue Transformer with BERT | |
CN107506414B (en) | Code recommendation method based on long-term and short-term memory network | |
CN112215013B (en) | Clone code semantic detection method based on deep learning | |
CN112487807B (en) | Text relation extraction method based on expansion gate convolutional neural network | |
CN109800434B (en) | Method for generating abstract text title based on eye movement attention | |
CN110210032A (en) | Text handling method and device | |
CN113672931B (en) | Software vulnerability automatic detection method and device based on pre-training | |
CN113011191B (en) | Knowledge joint extraction model training method | |
CN110427619B (en) | Chinese text automatic proofreading method based on multi-channel fusion and reordering | |
CN116127952A (en) | Multi-granularity Chinese text error correction method and device | |
Nagaraj et al. | Kannada to English Machine Translation Using Deep Neural Network. | |
CN110569505A (en) | text input method and device | |
CN117390141B (en) | Agricultural socialization service quality user evaluation data analysis method | |
CN114064487A (en) | Code defect detection method | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
CN114969763A (en) | Fine-grained vulnerability detection method based on seq2seq code representation learning | |
CN116661805A (en) | Code representation generation method and device, storage medium and electronic equipment | |
Ma et al. | Progressive multi-task learning framework for chinese text error correction | |
Eyraud et al. | TAYSIR Competition: Transformer+\textscrnn: Algorithms to Yield Simple and Interpretable Representations | |
CN117725211A (en) | Text classification method and system based on self-constructed prompt template | |
Wakchaure et al. | A scheme of answer selection in community question answering using machine learning techniques | |
CN112035347B (en) | Automatic exception handling method for source code | |
Ansari et al. | Hindi to English transliteration using multilayer gated recurrent units | |
CN116561323B (en) | Emotion analysis method based on aspect word embedding graph convolution network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |