CN114969763A - Fine-grained vulnerability detection method based on seq2seq code representation learning - Google Patents

Fine-grained vulnerability detection method based on seq2seq code representation learning Download PDF

Info

Publication number
CN114969763A
CN114969763A CN202210700763.1A CN202210700763A CN114969763A CN 114969763 A CN114969763 A CN 114969763A CN 202210700763 A CN202210700763 A CN 202210700763A CN 114969763 A CN114969763 A CN 114969763A
Authority
CN
China
Prior art keywords
statement
sentence
vector
code
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210700763.1A
Other languages
Chinese (zh)
Inventor
苏小红
蒋远
郑伟宁
陶文鑫
王甜甜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202210700763.1A priority Critical patent/CN114969763A/en
Publication of CN114969763A publication Critical patent/CN114969763A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a fine-grained vulnerability detection method based on seq2seq code representation learning, which comprises the steps of firstly extracting vulnerability candidate key nodes as slicing criteria, and then extracting slicing code segments in a program by using a program slicing technology. Then, a seq2 seq-based deep learning model is used for representing and learning the sliced code segments, a sentence vector representation sequence containing long dependency relations among sentences is generated, the vector representation of each sentence in the sequence is sent to a detector, and whether the sentence is a bug sentence or not is detected. The method can fully utilize global and local semantic information in the code to learn characteristics related to the vulnerability in the statement and between statements, avoid the problem that the long dependence information between the vulnerability statement and the context thereof is difficult to capture when the traditional deep learning classification model is used for representing and learning the code, utilize the seq2seq model to represent and learn the generated statement vector representation sequence, and is more suitable for statement-level fine-grained vulnerability detection.

Description

Fine-grained vulnerability detection method based on seq2seq code representation learning
Technical Field
The invention relates to a software vulnerability fine-grained detection method, in particular to a fine-grained vulnerability detection method based on seq2seq code representation learning.
Background
A vulnerability is defined as a security flaw in a computer system that not only threatens the system itself, but also makes it difficult for the system to guarantee the confidentiality, integrity and availability of application data and compromises the security of the system. In recent years, with the increase in system scale and complexity and the introduction of new technologies, the possibility of software bugs has increased. Automatic detection of vulnerabilities is an important means to reduce vulnerabilities. However, at present, most of the vulnerability detection research based on deep learning focuses on coarse-grained levels such as files, functions or fragments, that is, only the possibility that files, functions or codes contain vulnerabilities can be predicted, and in such a situation, the difficulty of manually positioning vulnerability statements by developers is high, so that vulnerabilities cannot be repaired in time. Therefore, the method for detecting the fine-grained vulnerability at the statement level is helpful for developers to understand and quickly repair the vulnerability, and is also the trend of the current vulnerability detection research. In recent years, although there is a small amount of research on fine-grained vulnerability detection methods, the positioning accuracy still needs to be improved. Sparsity and discontinuity of vulnerability information distribution, context dependency among vulnerability statements, complexity and concealment of vulnerability characteristics all provide challenges for fine-grained vulnerability detection.
Disclosure of Invention
The invention aims to provide a fine-grained vulnerability detection method based on seq2seq code representation learning, which can fully utilize global and local semantic information in a code and learn characteristics related to vulnerability in sentences and among sentences, avoid the problem that the long dependence information between the vulnerability sentences and the context thereof is difficult to capture when a traditional deep learning classification model is used for representing and learning the code, utilize a sentence vector representation sequence generated by representing and learning the code by using a seq2seq model, and is more suitable for sentence-level fine-grained vulnerability detection.
The purpose of the invention is realized by the following technical scheme:
a fine-grained vulnerability detection method based on seq2seq code representation learning comprises the steps of firstly, extracting vulnerability candidate key nodes as slicing criteria, and then extracting slicing code segments in a program by using a program slicing technology. Then, the slice code segment is representation-learned by using a seq2 seq-based deep learning model. Specifically, firstly, a sentence coding network in a seq2seq model encoder is used to perform representation learning on token sequences in the slice code segment sentences, and sentence primary vector representation containing local semantic information is generated. The term vector sequence is then representation-learned using a program coding network in the encoder, using as input a term sequence formed from the obtained primary vector representation for each term, to generate a term high-level vector representation containing term context information. Then, global semantic information related to dependencies and vulnerabilities between sentences in the learner is learned in tandem using a dual attention mechanism based on self-attention and textual attention. And finally, using the statement high-level vector representation generated by the encoder as input and combining global semantic information, generating a statement final vector representation sequence containing long dependency relations among statements by using a decoder network in a seq2seq model encoder, sending the final vector representation of each statement in the sequence into a detector, and detecting whether each statement is a bug statement. The method specifically comprises the following steps:
step 1: analyzing the source code by using a static analysis tool to generate an abstract syntax tree and a program dependency graph;
step 2: extracting vulnerability candidate key nodes of a source code by using an abstract syntax tree to serve as a slicing criterion, generating a slicing code segment of the source code, and carrying out standardization processing on the slicing code segment to obtain a named and standardized slicing code segment;
and step 3: using a sentence coding network in a seq2seq deep learning model coder to perform representation learning on token sequences in the slice code segment sentences to generate sentence primary vector representation containing local semantic information;
and 4, step 4: taking a sentence sequence formed by the primary vector representation of each sentence obtained in the step 3 as input, performing representation learning on the sentence sequence by using a program coding network in a seq2seq deep learning model coder, and generating a sentence high-level vector representation containing sentence context information;
and 5: transmitting the sentence sequence formed by the high-level vector expression of the sentence obtained in the step 4 into a double attention module based on self attention and text attention, learning the dependency relationship among the sentences through the self attention, and then learning the global semantic information related to the vulnerability through the text attention;
step 6: taking the sentence sequence formed by the vulnerability-related global semantic information obtained in the step 5 and the sentence high-level vector representation obtained in the step 4 as input, sending the sentence sequence into a long dependence information between learning sentences in a decoder network of a seq2seq deep learning model, and generating a final vector representation of the sentences;
and 7: sending the final vector representation of each statement obtained in the step 6 into a detector network consisting of a multilayer perceptron MLP and a softmax layer to obtain a prediction result of whether the statement has a hole or not, calculating a cross entropy loss function by using label information of the statement, adjusting network parameters according to error back propagation until a loss value does not decrease any more, and finishing training;
and 8: and carrying out statement-level fine-grained vulnerability detection on the code by using the trained model.
The deep learning model based on seq2seq is used for representing and learning codes, wherein a sentence coding network in a coder is used for extracting the dependency relationship between tokens in sentences to obtain local semantic information; the program coding network is used for learning the context information among the sentences; the double attention mechanism is used for acquiring global semantic information related to the vulnerability; the decoder network is used for acquiring long dependency relations among the sentences; the detector network is used for outputting a prediction result of whether the statement has a bug or not.
The existing vulnerability detection method is based on the code representation learning of deep learning models such as a convolutional neural network, a cyclic neural network or a graph neural network and the like to realize vulnerability detection. Different from the existing method, the invention provides a fine-grained vulnerability detection method based on seq2seq code representation learning, and the method uses a seq2seq model for statement-level fine-grained vulnerability detection task for the first time. Unlike the traditional binary model based on deep learning, the seq2seq model commonly used in machine translation task can directly implement sentence-to-sentence mapping, and the seq2seq model is composed of an encoder (encoder) and a decoder (decoder), wherein the encoder is used for encoding the input program sentence sequence with indefinite length into a sentence high-level vector representation with fixed length, and the decoder is used for re-decoding the sentence high-level vector representation output by the encoder into a program sentence vector representation sequence with indefinite length. Therefore, in terms of model structure, its sequence generation results can be directly applied to statement-level vulnerability detection. On the other hand, the seq2seq model can consider both local and global semantic information of statements in a sample program. Because vulnerability statements have high dependency on context, conventional RNN or CNN models can only extract short dependencies between adjacent statements, and it is difficult to capture long dependency information between statements that are far apart. In contrast, the seq2seq model encodes statements related to all vulnerabilities in the sample program, and uses the statements and global semantic information acquired based on the dual attention mechanism together to guide generation of statement vector sequences in a decoder, so that a more accurate statement vector representation sequence can be generated, and the statement sequence generation result can be directly used for statement-level fine-grained vulnerability detection.
Compared with the prior art, the invention has the following advantages:
(1) the method for directly representing and learning the slice code segment generated based on the program slicing technology by using the seq2seq deep learning model can fully utilize and learn the local and global semantic information of the vulnerability code, and the statement vector representation obtained based on the sequence generation model is more favorable for realizing statement-level fine-grained vulnerability detection.
(2) According to the method, a double attention mechanism based on self attention and text attention is introduced between an Encoder (Encoder) and a Decoder (Decoder) of a seq2seq model, so that the importance degree of a program statement to a vulnerability can be effectively learned while global semantic information is learned, and the accuracy of statement-level fine-grained vulnerability detection is improved.
Drawings
Fig. 1 is a schematic flow diagram of a source code fine-grained vulnerability detection method according to the present invention.
FIG. 2 is an example of vulnerability code.
FIG. 3 is a statement-level fine-grained vulnerability detection process.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.
The invention provides a fine-grained vulnerability detection method based on seq2seq code representation learning, which comprises the steps of firstly extracting vulnerability candidate key nodes as slicing criteria, and then extracting slicing code segments in a program by using a program slicing technology. Secondly, performing representation learning on the slice code segments based on a seq2seq deep learning model, converting the representation learning problem of sentences in the code into a sequence generation problem of the sentences, and realizing end-to-end learning of global and local semantic information of the code. An encoder consisting of a sentence encoding network and a program encoding network is constructed, and sentences in the sliced code segments are subjected to segmented representation learning, namely, the sentence encoding network is used for extracting the dependency relationship between tokens in the sentences, learning local semantic information, and then the program encoding network is used for learning the context information between the sentences. In order to be able to better learn the global semantics related to the vulnerability, the invention introduces a dual attention mechanism based on self attention and text attention between the Encoder and the Decoder. The self-attention mechanism is introduced to learn the dependency relationship between sentences in the program, and the text attention mechanism is introduced to obtain global semantic information related to the vulnerability. And finally, learning long dependence information among the sentences by using a decoder network and further combining global semantic information to obtain final vector representation of each sentence, and sending the final vector representation into a detector to realize sentence-level fine-grained vulnerability detection. As shown in fig. 1, the method comprises the following specific steps:
step 1: and analyzing the source code by using a static analysis tool to generate an abstract syntax tree and a program dependency graph.
And 2, step: extracting vulnerability candidate key nodes of a source code by using an abstract syntax tree to serve as a slicing criterion, generating a slicing code segment of the source code, and carrying out standardization processing on the slicing code segment to obtain a named and standardized slicing code segment;
and step 3: and performing representation learning on token sequences in the slice code segment sentences by using a sentence coding network in a seq2seq deep learning model coder to generate sentence primary vector representation containing local semantic information. The method comprises the following specific steps:
step 31: and splitting the sentences in the section code segment into tokens, and using a pre-trained word2vec word embedding model to obtain vector representation of each token to form a token vector matrix.
Step 32: and (4) sending the token vector matrix generated in the step (31) into a sentence coding network realized by a GRU, learning to obtain the hidden vector representation of each token, and performing weighted summation on the hidden vector representations of all tokens through learnable weights to obtain the primary vector representation of the sentence.
The specific calculation formula is as follows:
z=σ(W z ·w t +U z ·h (t-1) +b z )
r=σ(W r ·w t +U r ·h (t-1) +b r )
Figure BDA0003703892150000071
Figure BDA0003703892150000072
Figure BDA0003703892150000073
where z and r represent the update gate and reset gate, respectively, σ is the activation function, w t T-th in the presentation statementInitial vector representation of token, h t And
Figure BDA0003703892150000074
respectively representing the hidden state and the intermediate temporary state, W, of the t token z 、W r 、W h 、U z 、U r 、U h 、U t Is a learnable weight parameter, b z 、b r 、b h For bias terms, x is the statement vector representation and n is the total number of tokens in the statement.
And 4, step 4: and (4) transmitting the sentence sequence formed by the primary vector representation of each sentence obtained in the step (3) into a program coding network in a seq2seq deep learning model coder, and learning to obtain a high-level vector representation of the sentence containing the sentence context information. The method comprises the following specific steps:
step 41: and padding the vector representation of the statement in the slice code segment to obtain an initialized statement vector matrix consisting of statement vectors.
Step 42: and (4) sending the initialized statement vector matrix generated in the step (41) into a program coding network realized by the BiGRU, and learning to obtain the hidden vector representation of the statement in the program.
The specific calculation formula is as follows:
Figure BDA0003703892150000081
Figure BDA0003703892150000082
Figure BDA0003703892150000083
wherein x is i The vector representation representing the ith statement in the slicing code segment is obtained by the statement coding network in step 3, L is the total number of statements in the slicing code segment,
Figure BDA0003703892150000084
is through forward GRU unit
Figure BDA0003703892150000085
The resulting statement is forward to a hidden state,
Figure BDA0003703892150000086
then reverse hidden state, e i The high-level vector representation of the statement finally obtained in the step.
And 5: and (4) transmitting the sentence sequence formed by the high-level vector expression of the sentence obtained in the step (4) into a double attention module based on self attention and text attention, learning the dependency relationship among the sentences through the self attention, and learning the global semantic information related to the vulnerability through the text attention. The method comprises the following specific steps:
step 51: learning the dependency relationship among the sentences through a self-attention mechanism to obtain a hidden vector matrix of the sentences, wherein a specific calculation formula is as follows:
Figure BDA0003703892150000087
MutiHead(Q,K,V)=(head 1 ||head 2 ||…||head a )
Figure BDA0003703892150000088
X se =MutiHead(E,E,E)
wherein E ═ E 1 ,e 2 ,…,e L ]A matrix composed for statement high-level vector representation; mutihead is a multi-headed self-attention method that can map a statement vector representation to another fixed-length statement vector, head p Is the p-th head function in the multi-head attention method, and a is the total head number;
Figure BDA0003703892150000091
is the slice generation obtained after the self-attention extractionA code segment statement vector matrix; q, K, V, respectively representing a query vector, a target vector and a value vector in the Attention function Attention, and Q, K, V represents E as sentence vectors due to the adoption of an Attention mechanism; c is the vector dimension;
Figure BDA0003703892150000092
is a learnable weight matrix.
Step 52: and (4) calculating global semantic information by using a text attention mechanism and the statement vector matrix obtained in the step 51, wherein a specific calculation formula is as follows:
u txt =ωu rand +b
Figure BDA0003703892150000093
Figure BDA0003703892150000094
wherein u is rand Obtaining u for vectors initialized at random after linear layer processing txt
Figure BDA0003703892150000095
For the statement vector obtained in step 51, combine u txt Can be used to calculate the text attention value alpha i G is global semantic information finally obtained in the step, omega is a weight parameter which can be learned, and b is a bias term.
Step 6: and (4) taking the sentence sequence formed by the vulnerability-related global semantic information obtained in the step (5) and the sentence high-level vector representation obtained in the step (4) as input, sending the sentence sequence into a decoder network of a seq2seq deep learning model to learn long dependency information among the sentences, and generating final vector representation of the sentences. The method comprises the following specific steps:
step 61: in a decoder network, BiGRU is used as a main model, but the calculation method of the network unit is different from that of the traditional GRU, and the specific calculation formula is as follows:
z'=σ(W' Z ·e i +U' Z ·h' (i-1) +C z g+b' z )
r′=σ(W′ r ·e i +U′ r ·h′ (i-1) +C r g+b′ r )
Figure BDA0003703892150000101
Figure BDA0003703892150000102
we use- ' to distinguish between similar parameters as in the encoder, z ' and r ' represent the update gate and reset gate, respectively, e i The high-level vector representation of the ith statement in the slice code segment is obtained by the program coding network in the step 4, and g is the global semantic information vector h 'obtained in the step 5' i And
Figure BDA0003703892150000103
respectively representing the hidden state and the intermediate temporary state of the generated ith statement, sigma being an activation function, W' Z 、W' r 、W' h 、U' Z 、U' r 、U' h 、C z 、C r 、C h Is a learnable weight parameter, b' z 、b' r 、b' h Is the bias term.
Step 62: we will note the GRU unit altered at step 61 as GRU D Then the formula of the decoder network is as follows:
Figure BDA0003703892150000104
Figure BDA0003703892150000105
Figure BDA0003703892150000106
wherein the content of the first and second substances,
Figure BDA0003703892150000107
is through a forward decoder GRU unit
Figure BDA0003703892150000108
The resulting statement is forward to a hidden state,
Figure BDA0003703892150000109
then reverse hidden state, d i And representing the final vector of the statement finally obtained in the step.
And 7: and (4) sending the final vector representation of each statement obtained in the step (6) into a detector network consisting of a multilayer perceptron MLP and a softmax layer to obtain a prediction result of whether the statement has a hole or not, calculating a cross entropy loss function by using label information of the statement, adjusting network parameters according to error back propagation until the loss value does not decrease any more, and finishing training.
And 8: performing statement-level fine-grained vulnerability detection on the code by using the trained model, and specifically comprising the following steps:
and after the sample to be detected is input into the model, the probability that each statement in the program contains a bug can be output, when the probability is less than 0.5, the statement is a non-bug statement, otherwise, the statement is a bug statement.
Example (b):
taking the code bug shown in fig. 2 as an example, two statements in the code are bug statements, which are respectively line 8 and line 13. In a seq2seq model testing stage, firstly, a word2vec model is established through a source code corpus in a preprocessing stage to obtain a vector representation corresponding to token of each statement. Then, a program slicing technology and a static analysis tool are used for extracting slicing code segments in a source code, vector representation of token sequences in the slicing code segments is used as input of an encoder, the vector representation passes through an encoder network, an attention mechanism module, a decoder network and a detector network, and finally a vulnerability prediction result of each statement is obtained.

Claims (9)

1. A fine-grained vulnerability detection method based on seq2seq code representation learning is characterized by comprising the following steps:
step 1: analyzing the source code by using a static analysis tool to generate an abstract syntax tree and a program dependency graph;
step 2: extracting vulnerability candidate key nodes of a source code by using an abstract syntax tree to serve as a slicing criterion, generating a slicing code segment of the source code, and carrying out standardization processing on the slicing code segment to obtain a named and standardized slicing code segment;
and step 3: using a sentence coding network in a seq2seq deep learning model coder to perform representation learning on token sequences in the slice code segment sentences to generate sentence primary vector representation containing local semantic information;
and 4, step 4: taking a sentence sequence formed by the primary vector representation of each sentence obtained in the step 3 as input, performing representation learning on the sentence sequence by using a program coding network in a seq2seq deep learning model coder, and generating a sentence high-level vector representation containing sentence context information;
and 5: transmitting the sentence sequence formed by the high-level vector expression of the sentence obtained in the step 4 into a double attention module based on self attention and text attention, learning the dependency relationship among the sentences through the self attention, and then learning the global semantic information related to the vulnerability through the text attention;
step 6: taking the sentence sequence formed by the vulnerability-related global semantic information obtained in the step 5 and the sentence high-level vector representation obtained in the step 4 as input, sending the sentence sequence into a long dependence information between learning sentences in a decoder network of a seq2seq deep learning model, and generating a final vector representation of the sentences;
and 7: sending the final vector representation of each statement obtained in the step (6) into a detector network consisting of a plurality of layers of perceptrons (MLP) and softmax layers to obtain a prediction result of whether the statement has a hole or not, calculating a cross entropy loss function by using label information of the statement, adjusting network parameters according to error back propagation until a loss value does not decrease any more, and finishing training;
and 8: and carrying out statement-level fine-grained vulnerability detection on the code by using the trained model.
2. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 1, wherein the specific steps of the step 3 are as follows:
step 31: splitting the sentences in the section code segment into tokens, and using a pre-trained word2vec word embedding model to obtain vector representation of each token to form a token vector matrix;
step 32: and (3) sending the token vector matrix generated in the step (31) into a sentence coding network realized by a GRU, learning to obtain the hidden vector representation of each token, and performing weighted summation on the hidden vector representations of all the tokens through learnable weights to obtain the primary vector representation of the sentence.
3. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 2, wherein the specific calculation formula of the step 32 is as follows:
z=σ(W z ·w t +U z ·h (t-1) +b z )
r=σ(W r ·w t +U r ·h (t-1) +b r )
Figure FDA0003703892140000021
Figure FDA0003703892140000022
Figure FDA0003703892140000031
where z and r represent the update gate and reset gate, respectively, σ is the activation function, w t Initial vector representation, h, representing the t token in the sentence t And
Figure FDA0003703892140000032
respectively representing a hidden state and an intermediate temporary state, W, of the t-th token z 、W r 、W h 、U z 、U r 、U h 、U t Is a learnable weight parameter, b z 、b r 、b h For bias terms, x is the statement vector representation and n is the total number of tokens in the statement.
4. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 1, wherein the specific steps of the step 4 are as follows:
step 41: padding vector representation of statements in the slice code segment to obtain an initialized statement vector matrix consisting of statement vectors;
step 42: and (4) sending the initialized statement vector matrix generated in the step (41) into a program coding network realized by a BiGRU, and learning to obtain the hidden vector representation of the statement in the program.
5. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 4, wherein the specific calculation formula of the step 42 is as follows:
Figure FDA0003703892140000033
Figure FDA0003703892140000034
Figure FDA0003703892140000035
wherein x is i A vector representation representing the ith statement in the slicing code segment, L being the total number of statements in the slicing code segment,
Figure FDA0003703892140000036
is through forward GRU unit
Figure FDA0003703892140000037
The resulting statement is forward to a hidden state,
Figure FDA0003703892140000038
then reverse hidden state, e i Is a statement high-level vector representation.
6. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 1, wherein the specific steps of the step 5 are as follows:
step 51: learning the dependency relationship among the sentences through a self-attention mechanism to obtain a hidden vector matrix of the sentences;
step 52: the global semantic information is calculated using the text attention mechanism and the sentence vector matrix obtained in step 51.
7. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 6, wherein the specific calculation formula of the step 51 is as follows:
Figure FDA0003703892140000041
MutiHead(Q,K,V)=(head 1 ||head 2 ||…||head a )
Figure FDA0003703892140000042
X se =MutiHead(E,E,E)
wherein E ═ E 1 ,e 2 ,…,e L ]A matrix composed for statement high-level vector representation; mutihead is a multi-head self-attention method, head p Is the p-th head function in the multi-head attention method, and a is the total head number;
Figure FDA0003703892140000043
is a statement vector matrix of the sliced code segment obtained after self-attention extraction; q, K, V denotes the query vector, target vector and value vector from the Attention function Attention, respectively; c is the vector dimension;
Figure FDA0003703892140000044
Figure FDA0003703892140000045
is a learnable weight matrix.
8. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 6, wherein the specific calculation formula of the step 52 is as follows:
u txt =ωu rand +b
Figure FDA0003703892140000046
Figure FDA0003703892140000047
wherein u is rand Obtaining u for vectors initialized at random after linear layer processing txt
Figure FDA0003703892140000051
For the statement vector obtained in step 51, combine u txt For calculating the text attention value alpha i G is the global semantic information, ω is a learnable weight parameter, and b is a bias term.
9. The method for detecting the fine-grained vulnerability based on seq2seq code expression learning according to claim 1, wherein the specific steps of the step 6 are as follows:
step 61: in a decoder network, using BiGRU as a main model, the specific calculation formula of the network unit is as follows:
z'=σ(W' Z ·e i +U' Z ·h' (i-1) +C z g+b' z )
r′=σ(W′ r ·e i +U′ r ·h′ (i-1) +C r g+b′ r )
Figure FDA0003703892140000052
Figure FDA0003703892140000053
z 'and r' represent the update gate and the reset gate, respectively, e i High level vector representation representing the ith statement in a sliced code segment, g being the global semantic information vector, h' i And
Figure FDA0003703892140000054
respectively representing the hidden state and the intermediate temporary state of the generated ith statement, wherein sigma is an activation function, W' Z 、W' r 、W' h 、U' Z 、U' r 、U' h 、C z 、C r 、C h Is a learnable weight parameter, b' z 、b' r 、b' h Is a bias term;
step 62: recording the GRU unit changed in the step 61 as GRU D Then the formula of the decoder network is as follows:
Figure FDA0003703892140000055
Figure FDA0003703892140000056
Figure FDA0003703892140000057
wherein the content of the first and second substances,
Figure FDA0003703892140000058
is through a forward decoder GRU unit
Figure FDA0003703892140000059
The resulting statement is forward to a hidden state,
Figure FDA00037038921400000510
then reverse hidden state, d i And representing the final vector of the statement finally obtained in the step.
CN202210700763.1A 2022-06-20 2022-06-20 Fine-grained vulnerability detection method based on seq2seq code representation learning Pending CN114969763A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210700763.1A CN114969763A (en) 2022-06-20 2022-06-20 Fine-grained vulnerability detection method based on seq2seq code representation learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210700763.1A CN114969763A (en) 2022-06-20 2022-06-20 Fine-grained vulnerability detection method based on seq2seq code representation learning

Publications (1)

Publication Number Publication Date
CN114969763A true CN114969763A (en) 2022-08-30

Family

ID=82964430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210700763.1A Pending CN114969763A (en) 2022-06-20 2022-06-20 Fine-grained vulnerability detection method based on seq2seq code representation learning

Country Status (1)

Country Link
CN (1) CN114969763A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115422092A (en) * 2022-11-03 2022-12-02 杭州金衡和信息科技有限公司 Software bug positioning method based on multi-method fusion

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201917161D0 (en) * 2019-08-23 2020-01-08 Praetorian System and method for automatically detecting a security vulnerability in a source code using a machine learning model
CN111753303A (en) * 2020-07-29 2020-10-09 哈尔滨工业大学 Multi-granularity code vulnerability detection method based on deep learning and reinforcement learning
CN112699377A (en) * 2020-12-30 2021-04-23 哈尔滨工业大学 Function-level code vulnerability detection method based on slice attribute graph representation learning
CN114064487A (en) * 2021-11-18 2022-02-18 北京京航计算通讯研究所 Code defect detection method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201917161D0 (en) * 2019-08-23 2020-01-08 Praetorian System and method for automatically detecting a security vulnerability in a source code using a machine learning model
CN111753303A (en) * 2020-07-29 2020-10-09 哈尔滨工业大学 Multi-granularity code vulnerability detection method based on deep learning and reinforcement learning
CN112699377A (en) * 2020-12-30 2021-04-23 哈尔滨工业大学 Function-level code vulnerability detection method based on slice attribute graph representation learning
CN114064487A (en) * 2021-11-18 2022-02-18 北京京航计算通讯研究所 Code defect detection method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DAM H, TRAN T, PHAM T: "Automatic Feature Learning for Vulnerability Prediction", Retrieved from the Internet <URL:http://arxiv.org/abs/1708.02368> *
WEINING ZHENG; YUAN JIANG; XIAOHONG SU: "Vu1SPG: Vulnerability detection based on slice property graph representation learning", 2021 IEEE 32ND INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING (ISSRE), 11 February 2022 (2022-02-11) *
邹德清: "基于图结构源代码切片的智能化漏洞检测系统", 《网络与信息安全学报》, 31 October 2021 (2021-10-31) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115422092A (en) * 2022-11-03 2022-12-02 杭州金衡和信息科技有限公司 Software bug positioning method based on multi-method fusion

Similar Documents

Publication Publication Date Title
CN110929030B (en) Text abstract and emotion classification combined training method
US20210232773A1 (en) Unified Vision and Dialogue Transformer with BERT
CN107506414B (en) Code recommendation method based on long-term and short-term memory network
CN112215013B (en) Clone code semantic detection method based on deep learning
CN112487807B (en) Text relation extraction method based on expansion gate convolutional neural network
CN109800434B (en) Method for generating abstract text title based on eye movement attention
CN110210032A (en) Text handling method and device
CN113672931B (en) Software vulnerability automatic detection method and device based on pre-training
CN113011191B (en) Knowledge joint extraction model training method
CN110427619B (en) Chinese text automatic proofreading method based on multi-channel fusion and reordering
CN116127952A (en) Multi-granularity Chinese text error correction method and device
Nagaraj et al. Kannada to English Machine Translation Using Deep Neural Network.
CN110569505A (en) text input method and device
CN117390141B (en) Agricultural socialization service quality user evaluation data analysis method
CN114064487A (en) Code defect detection method
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN114969763A (en) Fine-grained vulnerability detection method based on seq2seq code representation learning
CN116661805A (en) Code representation generation method and device, storage medium and electronic equipment
Ma et al. Progressive multi-task learning framework for chinese text error correction
Eyraud et al. TAYSIR Competition: Transformer+\textscrnn: Algorithms to Yield Simple and Interpretable Representations
CN117725211A (en) Text classification method and system based on self-constructed prompt template
Wakchaure et al. A scheme of answer selection in community question answering using machine learning techniques
CN112035347B (en) Automatic exception handling method for source code
Ansari et al. Hindi to English transliteration using multilayer gated recurrent units
CN116561323B (en) Emotion analysis method based on aspect word embedding graph convolution network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination