CN117113359A - Pre-training vulnerability restoration method based on countermeasure migration learning - Google Patents
Pre-training vulnerability restoration method based on countermeasure migration learning Download PDFInfo
- Publication number
- CN117113359A CN117113359A CN202311135429.7A CN202311135429A CN117113359A CN 117113359 A CN117113359 A CN 117113359A CN 202311135429 A CN202311135429 A CN 202311135429A CN 117113359 A CN117113359 A CN 117113359A
- Authority
- CN
- China
- Prior art keywords
- token
- code
- vulnerability
- code generator
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000012549 training Methods 0.000 title claims abstract description 35
- 238000013508 migration Methods 0.000 title claims abstract description 16
- 230000005012 migration Effects 0.000 title claims description 6
- 230000008439 repair process Effects 0.000 claims abstract description 39
- 230000006870 function Effects 0.000 claims abstract description 29
- 230000011218 segmentation Effects 0.000 claims abstract description 19
- 230000001364 causal effect Effects 0.000 claims abstract description 6
- 238000005070 sampling Methods 0.000 claims description 11
- 230000008859 change Effects 0.000 claims description 7
- 238000013526 transfer learning Methods 0.000 abstract 1
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 230000003042 antagnostic effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
- G06F2221/033—Test or assess software
Abstract
The invention discloses a pre-training vulnerability restoration method based on countermeasure transfer learning, which comprises the following steps: a code generator that constructs a shallow encoder-deep decoder architecture; word segmentation is carried out on a large code data set of the function level by using a Unigram LM word segmentation device; pre-training the code generator using an improved causal language modeling technique, span de-noising technique; extracting an encoder set of a pre-trained code generator to construct a arbiter; constructing and generating an countermeasure network by combining a pre-trained code generator and a discriminator; performing word segmentation on the vulnerability restoration data set at the function level by using a pre-trained word segmentation device; obtaining an optimal code generator by utilizing the generated countermeasure network countermeasure training; and inputting the vulnerability codes of the functional level to be repaired into a pre-trained word segmentation device, and obtaining a repair sequence by using an optimal code generator. According to the method, vulnerability restoration is performed by using anti-migration learning, so that generalization and robustness of the model are improved, the cost of software vulnerability restoration is reduced, and the accuracy of software vulnerability restoration is improved.
Description
Technical Field
The invention belongs to the field of software debugging, and particularly relates to a pre-training vulnerability restoration method based on anti-migration learning.
Background
With the increase of the number and complexity of the software vulnerabilities, developers need to know about the software vulnerabilities in depth, influence on system functions is reduced as much as possible, and cost of repairing the software vulnerabilities is greatly increased. In order to reduce the software bug fix cost, researchers have proposed techniques for automatically fixing software bugs. But the vulnerability restoration data set which can be acquired from the internet is small in scale, and great challenges are brought to researchers.
The university of Yangzhou proposes a technology for automatically repairing a bug code by using a syntax tree characterization code in a patent document (a tree-based bug repairing system and repairing method) (patent application number: 202210027014.7, publication number: CN 114547619A) filed by Yangzhou university. Firstly, collecting a bug repair data set on Github, converting codes in the bug repair data set into grammar tree AST with data flow dependence and control flow dependence, abstracting and normalizing the grammar tree AST to obtain a token sequence, dividing the token sequence into a training set and a testing set, and inputting the training set and the testing set into a Transformer model with the same number of encoders and decoders to train and test. The invention realizes automatic restoration of codes by using a grammar tree and a transducer model, and improves the efficiency of code restoration. However, the method still has the defects that:
(1) According to the method, model training is carried out only by relying on the vulnerability restoration data set, under the current situation that the vulnerability restoration data set is small in scale, the number of vulnerabilities of individual CWE types in the data set is small or interference is strong, and when the model does not learn the characteristics of the vulnerability completely, the model is poor in performance, and generalization and robustness are weakened;
(2) When the method is used for abstracting and normalizing a code dataset, the function name, the variable and the value of the data are replaced, so that a model cannot learn the potential semantic meaning of the code, and the code understanding capability of the model is poor;
(3) The method is excessively dependent on a transducer model to generate a repair code, and the model can repair part of correct codes in error, so that the model is over-fitted.
The invention provides a pre-training vulnerability restoration method based on countermeasure migration learning, which has the advantages that:
(1) The pre-trained code generator model is obtained after the pre-training is carried out on the large code data set, so that the model has better code understanding capability, code generation capability and complementation capability;
(2) The method and the system have the advantages that the pre-trained code generator model is finely adjusted on the vulnerability restoration data set by generating the countermeasure network architecture, and the anti-interference capability and the restoration capability of the model are improved by generating the countermeasure training mechanism of the countermeasure network, so that the model has higher robustness and generalization, and meanwhile, the problem of model overfitting is solved.
According to the method, the pre-trained code generator model is obtained after pre-training is performed on the large code data set, then the vulnerability restoration data set is directly used for countermeasure training, so that the model reduces the dependence of source domain data, the model can better adapt to the data and feature distribution of the target domain, the difference between the source domain and the target domain is reduced, the training speed of the model is improved, and finally the vulnerability restoration accuracy of the model is improved.
Disclosure of Invention
The invention aims to: the invention aims to design a bug repairing method with strong generalization capability, strong robustness and high repairing accuracy so as to adapt to the current situation of small size of bug repairing data sets.
The technical scheme is as follows: in order to solve the technical problems, the invention designs a pre-training vulnerability restoration method based on anti-migration learning, which comprises the following steps:
s100, constructing a code generator model of a shallow encoder-deep decoder architecture;
s200, based on the step S100, using a large code data set of a function level to pretrain the code generator model by using an improved pretraining technology, so as to obtain a pretrained code generator model;
s300, based on the step S200, extracting an encoder set of the code generator model to construct a discriminator model;
s400, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminant model based on the step S200 and the step S300; retraining the generated countermeasure network by utilizing the vulnerability restoration data set of the function level to obtain an optimal code generator model suitable for restoring vulnerability codes;
s500, inputting the vulnerability codes of the function level into the optimal code generator model based on the step S400 to obtain the repaired codes.
Further, in step S100, step S100 specifically includes:
the encoder and decoder are based on the encoder and decoder in the CodeT5 model, and the shallow encoder-deep decoder architecture represents a greater number of decoders than encoders in the code generator model.
Further, in step S200, step S200 includes the steps of:
s210, converting the large code data set of the function level into a code token sequence by using an initial Unigram LM (unified language model) word segmentation device to obtain a pre-trained word segmentation device and the code token sequence;
s220, based on the step S100 and the step S210, performing first-step pre-training on the code generator model by utilizing an improved causal language modeling technology to obtain a preliminary pre-trained code generator model;
s230, based on the step S210 and the step S220, performing second-step pre-training on the preliminary pre-trained code generator model by utilizing an improved Span de-noising technology to obtain a pre-trained code generator model;
wherein the improved Span de-noising technique comprises:
10% of the TOKEN "[ TOKEN 0], [ TOKEN n ]" is a predefined TOKEN "[ LABEL 0], [ LABEL n ]" is replaced with 50% probability in the input TOKEN sequence of the encoder, and a special TOKEN "[ SOM ]" is added before it; adding special token "[ EOM ]" as target token sequence output by decoder before correct token sequence; the decoder is caused to generate a replaced TOKEN sequence, [ TOKEN 0], [ TOKEN n ], "to obtain a pre-trained code generator model.
Further, in step S220, step S220 includes the steps of:
s221, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOB ]" after the token sequence preceding the selected token; taking the token sequence added with the special token as a model input, and taking the token sequence after the selected token as a model output;
s222, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOF ]" before the token sequence following the selected token; and taking the token sequence added with the special token as a model input, and taking the token sequence before the selected token as a model output to obtain a preliminary pre-trained code generator model.
Further, in step S300, step S300 includes the steps of:
s310, extracting encoders of the pre-trained code generator model based on the step S200 to obtain an encoder set;
wherein the encoder set comprises parameters of the pre-trained code generator model encoder set;
s320, based on the step S310, combining the encoder set with a linear change layer and an output layer to obtain a discriminator model;
further, in step S400, step S400 includes the steps of:
s410, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminator model based on the step S200 and the step S300;
s420, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability repair data set at the function level to obtain a vulnerability code token sequence and a repair code token sequence;
s430, inputting the vulnerability code token sequence and the repair code token sequence into the code generator model of the generated countermeasure network at the same time based on the step S410 and the step S420 to obtain a generated probability sequence;
meanwhile, the code generator model learns the difference between the generated probability sequence and the input repair code token sequence to obtain a loss value a;
s440, based on the step S410, the step S420 and the step S430, utilizing a nucleic Sampling algorithm to optimally arrange the generated probability sequence to obtain a bug code repairing token sequence;
meanwhile, inputting the repair code token sequence and the vulnerability code repair token sequence into a discriminator model for generating an countermeasure network, and learning the difference between the repair code token sequence and the vulnerability code repair token sequence by the discriminator model to obtain a loss value b;
s450, optimizing the code generator model by the optimizer according to the loss value a and the loss value b to obtain an optimal code generator model.
Further, in step S500, step S500 includes the steps of:
s510, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability codes at the function level to obtain a token sequence of the vulnerability codes to be repaired;
s520, inputting the code token sequence of the vulnerability to be repaired into an optimal code generator model based on the step S400 and the step S510 to obtain a probability sequence of the repair code;
s530, based on the step S520, optimally arranging the repair code probability sequences by using a nucleous Sampling algorithm again to obtain the repaired codes.
Compared with the prior art, the invention has the beneficial effects that:
(1) The shallow encoder-deep decoder architecture used in the invention performs probability sequence generation, and compared with a converter model with consistent numbers of encoders and decoders, the shallow encoder-deep decoder architecture has better performance on code generation tasks;
(2) Compared with the method that the large code data set with the function level is used for training a transducer model after abstracting and normalizing the vulnerability restoration data set, the method can learn a wider code structure, semantic meaning and characteristics so as to adapt to the task of restoring the vulnerability under the condition that the vulnerability restoration data set is small;
(3) Compared with direct training of a transducer model, the method and the device for model training by using the anti-migration learning can use the generated error repair code for reversely training the code generator model, and the migration learning can migrate knowledge in the code generation field to the vulnerability code repair field, so that the robustness and generalization of the model are improved.
Drawings
FIG. 1 is a system flow diagram of the present invention;
FIG. 2 is a flow chart of one embodiment of a pre-trained code generator model in the present invention;
FIG. 3 is a flow chart of one embodiment of the present invention for building a model for generating and training an optimal code generator;
FIG. 4 is a schematic diagram of one embodiment of a generated countermeasure network constructed in the present invention;
FIG. 5 is a flow chart of one embodiment of repairing vulnerability code to be repaired in accordance with the present invention.
Detailed Description
The technical scheme of the present invention will be clearly and completely described in the following with reference to the accompanying drawings and examples. The following examples or figures are illustrative of the invention and are not intended to limit the scope of the invention.
Referring to fig. 1-5, the present embodiment provides a pre-training vulnerability restoration method based on anti-migration learning, including:
in one embodiment, as shown in fig. 1, fig. 1 is a system flow of the present invention.
S100, constructing a code generator model of a shallow encoder-deep decoder architecture.
Specifically, the encoder and decoder are based on the encoder and decoder in the CodeT5 model, the architecture is a T5 architecture, the shallow encoder-deep decoder architecture represents that the number of decoders in the code generator model is greater than the number of encoders, the encoders and decoders in the code generator model are connected through a cross attention layer, and the last decoder in the code generator model is connected with a linear change layer and an output layer.
For example, the shallow encoder-deep decoder architecture of the code generator model may set the number of encoders to 12 and the number of decoders to 18;
the linear change layer may adopt a neural network suitable for a regression task, and the output layer may adopt a neural network suitable for a text generation task.
For example, the linear change layer adopts a fully connected layer neural network, and the activation function of the linear change layer can adopt a ReLU activation function; the output layer may employ a Softmax probability output layer.
S200, based on the step S100, the code generator model is pre-trained by using an improved pre-training technology through a large code data set with a function level, and the pre-trained code generator model is obtained.
In one embodiment, as shown in fig. 2, the step S200 includes the following steps, which is how to pretrain the code generator model in fig. 2:
s210, converting the large code data set of the function level into a code token sequence by using an initial Unigram LM (unified language model) word segmentation device to obtain a pre-trained word segmentation device and the code token sequence;
for example, the large Code data set at the function level includes a codeSearchNet data set, a Github-Code data set and the like disclosed in the field of software engineering, or a data set in which different large Code data sets such as the codeSearchNet data set and the Github-Code data set are combined and then duplicated;
for example, the sentence "This is a test" is input to a word segmentation device, and the code token sequence obtained after word segmentation is: "'_ Thi','s','_ is','_ a','_ t', 'est', 'term'.
S220, based on the step S100 and the step S210, performing first-step pre-training on the code generator model by utilizing an improved causal language modeling technology to obtain a preliminary pre-trained code generator model;
wherein the improved causal language modeling technique is divided into two steps;
s221, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOB ]" after the token sequence preceding the selected token; taking the token sequence added with the special token as a model input, and taking the token sequence after the selected token as a model output;
for example, for the code token sequence "_thi ','s ', ' is ', ' a ','t ', ' est ', ' term ', the selected token is" _is ", the input token sequence of the model is: the output token sequence of the model is "_Thi ','s ', ' GOB ] '", which is: "_ a '," _ t ', "est '," v.
S222, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOF ]" before the token sequence following the selected token; taking the token sequence added with the special token as a model input, and taking the token sequence before the selected token as a model output to obtain a preliminary pre-trained code generator model;
for example, for the code token sequence "_thi ','s ', ' is ', ' a ','t ', ' est ', ' a ', the selected token is" _a ", the input token sequence of the model is: the output token sequence of the model is "" ' [ GOF ] "," t ', "est '," v: "'_ Thi','s','_ is'".
S230, based on the step S210 and the step S220, performing second-step pre-training on the preliminary pre-trained code generator model by utilizing an improved Span de-noising technology to obtain a pre-trained code generator model;
wherein the improved Span de-noising technique comprises:
10% of the TOKEN "[ TOKEN 0], [ TOKEN n ]" is a predefined TOKEN "[ LABEL 0], [ LABEL n ]" is replaced with 50% probability in the input TOKEN sequence of the encoder, and a special TOKEN "[ SOM ]" is added before it; adding special token "[ EOM ]" as target token sequence output by decoder before correct token sequence; the decoder is caused to generate a replaced TOKEN sequence, [ TOKEN 0], [ TOKEN n ], "to obtain a pre-trained code generator model.
For example, the input token sequence of the encoder is: "'_ Thi','s','_ is','_ a','_ t', 'est','' token is replaced: "'_is'", the token sequence after substitution is: the target token sequence output by the decoder is "" '_ Thi', `s ', `SOM ]', `LABEL 0] ', `a', `t ', `est', `as`). "' _ Thi ','s ',' [ EOM ] ', ' _ is ',' _ a ',' _ t ', ' est ', '.
S300, based on the step S200, extracting an encoder set of the code generator model to construct a discriminator model.
In one embodiment, the step S300 includes the steps of:
s310, extracting encoders of the pre-trained code generator model based on the step S200 to obtain an encoder set;
wherein the encoder set comprises parameters of the pre-trained code generator model encoder set;
s320, based on the step S310, the encoder set, the linear change layer and the output layer are combined to obtain a discriminator model.
For example, the linear variable layer and the output layer may employ the same full connection layer and Softmax probability output layer as in step S100.
S400, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminant model based on the step S200 and the step S300; and retraining the generated countermeasure network by utilizing the vulnerability restoration data set of the function level to obtain an optimal code generator model suitable for restoring the vulnerability codes.
In one embodiment, as shown in fig. 3 and fig. 4, fig. 3 is a schematic diagram of how to construct a generated countermeasure network and train an optimal code generator model, fig. 4 is a schematic diagram of one embodiment of the generated countermeasure network constructed in this step, and step S400 includes the following steps:
s410, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminator model based on the step S200 and the step S300;
s420, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability repair data set at the function level to obtain a vulnerability code token sequence and a repair code token sequence;
for example, the vulnerability restoration data set of the function level includes a CVEfix data set, a Big-Vul data set and the like disclosed in the software security field, or a vulnerability restoration data set in which different vulnerability restoration data sets such as the CVEfix data set and the Big-Vul data set are combined and de-duplicated.
S430, inputting the vulnerability code token sequence and the repair code token sequence into the code generator model of the generated countermeasure network at the same time based on the step S410 and the step S420 to obtain a generated probability sequence;
meanwhile, the code generator model learns the difference between the generated probability sequence and the input repair code token sequence to obtain a loss value a.
For example, the loss value a may be learned using a cross entropy loss function.
S440, based on the step S410, the step S420 and the step S430, utilizing a nucleic Sampling algorithm to optimally arrange the generated probability sequence to obtain a bug code repairing token sequence;
and simultaneously, inputting the repair code token sequence and the vulnerability code repair token sequence into a discriminator model for generating an countermeasure network, and learning the difference between the repair code token sequence and the vulnerability code repair token sequence by the discriminator model to obtain a loss value b.
For example, when the probability sequence is optimally arranged by using a nucleos Sampling algorithm to obtain a bug code repairing token sequence, setting top_p=0.9 (the parameter represents an accumulated probability value), max_length=50 (the parameter represents the maximum length of a generated sequence), temperature=0.8 (the parameter represents the parameter of the smoothness degree of probability distribution in the Sampling process), num_return_sequences=50 (the parameter represents the number of generated sequences), and generating 50 bug code repairing sequences after the parameter is set, wherein the maximum length of each sequence is 50 tokens; the loss value b may be learned using a cross entropy loss function.
S450, optimizing the code generator model by the optimizer according to the loss value a and the loss value b to obtain an optimal code generator model.
For example, the optimizer may use an AdamW optimizer to train the above-described generation antagonistic neural network for 100 periods in a batch size of 8, with a learning rate set to 2e-5, a weight decay set to 1e-4, a warm-up step number set to 200, and a gradient accumulation step number set to 8, resulting in an optimal code generator model.
S500, inputting the vulnerability codes of the function level into the optimal code generator model based on the step S400 to obtain the repaired codes.
In one embodiment, as shown in fig. 5, the step S500 includes the following steps in how to repair the bug code to be repaired, where fig. 5 is as follows:
s510, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability codes at the function level to obtain a token sequence of the vulnerability codes to be repaired;
s520, inputting the code token sequence of the vulnerability to be repaired into an optimal code generator model based on the step S400 and the step S510 to obtain a probability sequence of the repair code;
s530, based on the step S520, optimally arranging the repair code probability sequences by using a nucleous Sampling algorithm again to obtain the repaired codes.
For example, when the repaired code is obtained by using the nucleous Sampling algorithm again, a top_p=0.9 (the parameter represents the cumulative probability value), a max_length=50 (the parameter represents the maximum length of the generated sequence), a temperature=0.8 (the parameter represents the parameter of the smoothness of the probability distribution in the Sampling process), and a num_return_sequences=5 (the parameter represents the number of the generated sequences) of the nucleous Sampling algorithm are set, and after the parameter is set, 5 bug code repair sequences are generated, wherein the maximum length of each sequence is 50 token.
Finally, it should be noted that: the foregoing description is only one embodiment of the present invention, and the present invention is not limited to the foregoing embodiments, but may be modified or substituted for some of the technical features described in the foregoing embodiments by those skilled in the art.
Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (7)
1. The pre-training vulnerability restoration method based on the challenge migration learning is characterized by comprising the following steps of:
s100, constructing a code generator model of a shallow encoder-deep decoder architecture;
s200, based on the step S100, using a large code data set of a function level to pretrain the code generator model by using an improved pretraining technology, so as to obtain a pretrained code generator model;
s300, based on the step S200, extracting an encoder set of the code generator model to construct a discriminator model;
s400, constructing and generating an countermeasure network by utilizing the pre-trained code generator model and the discriminant model based on the step S200 and the step S300; retraining the generated countermeasure network by utilizing the vulnerability restoration data set of the function level to obtain an optimal code generator model suitable for restoring vulnerability codes;
s500, inputting the vulnerability codes of the function level into the optimal code generator model based on the step S400 to obtain the repaired codes.
2. The vulnerability restoration method based on anti-migration learning of claim 1, wherein the code generator model: a shallow encoder-deep decoder architecture;
wherein the encoder and decoder are based on encoders and decoders in a CodeT5 model, and the shallow encoder-deep decoder architecture represents a greater number of decoders than encoders in a code generator model.
3. The vulnerability restoration method based on the challenge migration learning of claim 2, wherein the step S200 comprises the steps of:
s210, converting the large code data set of the function level into a code token sequence by using an initial Unigram LM (unified language model) word segmentation device to obtain a pre-trained word segmentation device and the code token sequence;
s220, based on the step S100 and the step S210, performing first-step pre-training on the code generator model by utilizing an improved causal language modeling technology to obtain a preliminary pre-trained code generator model;
s230, based on the step S210 and the step S220, performing second-step pre-training on the preliminary pre-trained code generator model by utilizing an improved Span de-noising technology to obtain a pre-trained code generator model;
wherein the improved Span de-noising technique comprises:
10% of the TOKEN "[ TOKEN 0], [ TOKEN n ]" is a predefined TOKEN "[ LABEL 0], [ LABEL n ]" is replaced with 50% probability in the input code TOKEN sequence of the encoder, and a special TOKEN "[ SOM ]" is added before it; adding special token "[ EOM ]" as target token sequence output by decoder before correct token sequence; the decoder is caused to generate a replaced TOKEN "[ TOKEN 0], [ TOKEN n ]," resulting in a pre-trained code generator model.
4. The vulnerability restoration method based on anti-migration learning of claim 2, wherein the improved causal language modeling technique of step S220 comprises the steps of:
s221, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOB ]" after the token sequence preceding the selected token; taking the token sequence added with the special token as a model input, and taking the token sequence after the selected token as a model output;
s222, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOF ]" before the token sequence following the selected token; and taking the token sequence added with the special token as a model input, and taking the token sequence before the selected token as a model output to obtain a preliminary pre-trained code generator model.
5. The vulnerability restoration method based on anti-migration learning of claim 3, wherein the step S300 comprises the steps of:
s310, extracting encoders of the pre-trained code generator model based on the step S200 to obtain an encoder set;
wherein the encoder set comprises parameters of an encoder set in the pre-trained code generator model;
s320, based on the step S310, the encoder set, the linear change layer and the output layer are combined to obtain a discriminator model.
6. The vulnerability restoration method based on anti-migration learning of claim 4, wherein the step S400 comprises the steps of:
s410, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminator model based on the step S200 and the step S300;
s420, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability repair data set at the function level to obtain a vulnerability code token sequence and a repair code token sequence;
s430, inputting the vulnerability code token sequence and the repair code token sequence into the code generator model of the generated countermeasure network at the same time based on the step S410 and the step S420 to obtain a generated probability sequence;
meanwhile, the code generator model learns the difference between the generated probability sequence and the input repair code token sequence to obtain a loss value a;
s440, based on the step S410, the step S420 and the step S430, utilizing a nucleic Sampling algorithm to optimally arrange the generated probability sequence to obtain a bug code repairing token sequence;
meanwhile, inputting the repair code token sequence and the vulnerability code repair token sequence into a discriminator model for generating an countermeasure network, and learning the difference between the repair code token sequence and the vulnerability code repair token sequence by the discriminator model to obtain a loss value b;
s450, optimizing the code generator model by the optimizer according to the loss value a and the loss value b to obtain an optimal code generator model.
7. The vulnerability restoration method based on anti-migration learning of claim 5, wherein the step S500 comprises the steps of:
s510, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability codes at the function level to obtain a token sequence of the vulnerability codes to be repaired;
s520, inputting the code token sequence of the vulnerability to be repaired into an optimal code generator model based on the step S400 and the step S510 to obtain a probability sequence of the repair code;
s530, based on the step S520, optimally arranging the repair code probability sequences by using a nucleous Sampling algorithm again to obtain the repaired codes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311135429.7A CN117113359B (en) | 2023-09-05 | 2023-09-05 | Pre-training vulnerability restoration method based on countermeasure migration learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311135429.7A CN117113359B (en) | 2023-09-05 | 2023-09-05 | Pre-training vulnerability restoration method based on countermeasure migration learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117113359A true CN117113359A (en) | 2023-11-24 |
CN117113359B CN117113359B (en) | 2024-03-19 |
Family
ID=88805401
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311135429.7A Active CN117113359B (en) | 2023-09-05 | 2023-09-05 | Pre-training vulnerability restoration method based on countermeasure migration learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117113359B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200151081A1 (en) * | 2017-11-13 | 2020-05-14 | The Charles Stark Draper Laboratory, Inc. | Automated Repair Of Bugs And Security Vulnerabilities In Software |
WO2021148625A1 (en) * | 2020-01-23 | 2021-07-29 | Debricked Ab | A method for identifying vulnerabilities in computer program code and a system thereof |
US20210357307A1 (en) * | 2020-05-15 | 2021-11-18 | Microsoft Technology Licensing, Llc. | Automated program repair tool |
CN114048464A (en) * | 2022-01-12 | 2022-02-15 | 北京大学 | Ether house intelligent contract security vulnerability detection method and system based on deep learning |
US20220092411A1 (en) * | 2020-09-21 | 2022-03-24 | Samsung Sds Co., Ltd. | Data prediction method based on generative adversarial network and apparatus implementing the same method |
CN114547619A (en) * | 2022-01-11 | 2022-05-27 | 扬州大学 | Vulnerability repairing system and method based on tree |
US20220292200A1 (en) * | 2021-03-10 | 2022-09-15 | Huazhong University Of Science And Technology | Deep-learning based device and method for detecting source-code vulnerability with improved robustness |
CN115168865A (en) * | 2022-06-28 | 2022-10-11 | 南京大学 | Cross-item vulnerability detection model based on domain self-adaptation |
CN115396156A (en) * | 2022-07-29 | 2022-11-25 | 中国人民解放军国防科技大学 | Vulnerability priority processing method based on deep reinforcement learning |
CN116595530A (en) * | 2022-12-08 | 2023-08-15 | 北京工业大学 | Intelligent contract vulnerability detection method combining countermeasure migration learning and multitask learning |
CN116628707A (en) * | 2023-07-19 | 2023-08-22 | 山东省计算中心(国家超级计算济南中心) | Interpretable multitasking-based source code vulnerability detection method |
-
2023
- 2023-09-05 CN CN202311135429.7A patent/CN117113359B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200151081A1 (en) * | 2017-11-13 | 2020-05-14 | The Charles Stark Draper Laboratory, Inc. | Automated Repair Of Bugs And Security Vulnerabilities In Software |
WO2021148625A1 (en) * | 2020-01-23 | 2021-07-29 | Debricked Ab | A method for identifying vulnerabilities in computer program code and a system thereof |
US20210357307A1 (en) * | 2020-05-15 | 2021-11-18 | Microsoft Technology Licensing, Llc. | Automated program repair tool |
US20220092411A1 (en) * | 2020-09-21 | 2022-03-24 | Samsung Sds Co., Ltd. | Data prediction method based on generative adversarial network and apparatus implementing the same method |
US20220292200A1 (en) * | 2021-03-10 | 2022-09-15 | Huazhong University Of Science And Technology | Deep-learning based device and method for detecting source-code vulnerability with improved robustness |
CN114547619A (en) * | 2022-01-11 | 2022-05-27 | 扬州大学 | Vulnerability repairing system and method based on tree |
CN114048464A (en) * | 2022-01-12 | 2022-02-15 | 北京大学 | Ether house intelligent contract security vulnerability detection method and system based on deep learning |
CN115168865A (en) * | 2022-06-28 | 2022-10-11 | 南京大学 | Cross-item vulnerability detection model based on domain self-adaptation |
CN115396156A (en) * | 2022-07-29 | 2022-11-25 | 中国人民解放军国防科技大学 | Vulnerability priority processing method based on deep reinforcement learning |
CN116595530A (en) * | 2022-12-08 | 2023-08-15 | 北京工业大学 | Intelligent contract vulnerability detection method combining countermeasure migration learning and multitask learning |
CN116628707A (en) * | 2023-07-19 | 2023-08-22 | 山东省计算中心(国家超级计算济南中心) | Interpretable multitasking-based source code vulnerability detection method |
Non-Patent Citations (5)
Title |
---|
ZHAO QIANCHONG ET AL.: "VULDEFF: Vulnerability detection method based on function fingerprints and code differences", KNOWLEDGE-BASED SYSTEMS, vol. 260, 25 January 2023 (2023-01-25) * |
刘嘉勇 等: "源代码漏洞静态分析技术", 信息安全学报, vol. 7, no. 4, 15 July 2022 (2022-07-15) * |
李元诚;崔亚奇;吕俊峰;来风刚;张攀;: "开源软件漏洞检测的混合深度学习方法", 计算机工程与应用, no. 11, 17 December 2018 (2018-12-17) * |
李韵;黄辰林;王中锋;袁露;王晓川;: "基于机器学习的软件漏洞挖掘方法综述", 软件学报, no. 07, 15 July 2020 (2020-07-15) * |
陈肇炫;邹德清;李珍;金海;: "基于抽象语法树的智能化漏洞检测系统", 信息安全学报, no. 04, 15 July 2020 (2020-07-15) * |
Also Published As
Publication number | Publication date |
---|---|
CN117113359B (en) | 2024-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112215013B (en) | Clone code semantic detection method based on deep learning | |
CN112364125B (en) | Text information extraction system and method combining reading course learning mechanism | |
WO2021015936A1 (en) | Word-overlap-based clustering cross-modal retrieval | |
CN112507337A (en) | Implementation method of malicious JavaScript code detection model based on semantic analysis | |
CN101751385A (en) | Multilingual information extraction method adopting hierarchical pipeline filter system structure | |
CA3135717A1 (en) | System and method for transferable natural language interface | |
CN113822054A (en) | Chinese grammar error correction method and device based on data enhancement | |
CN114742069A (en) | Code similarity detection method and device | |
CN114048314A (en) | Natural language steganalysis method | |
CN117113359B (en) | Pre-training vulnerability restoration method based on countermeasure migration learning | |
CN113741886A (en) | Statement level program repairing method and system based on graph | |
CN117421595A (en) | System log anomaly detection method and system based on deep learning technology | |
CN116882402A (en) | Multi-task-based electric power marketing small sample named entity identification method | |
CN116484851A (en) | Pre-training model training method and device based on variant character detection | |
CN115495085A (en) | Generation method and device based on deep learning fine-grained code template | |
CN116483314A (en) | Automatic intelligent activity diagram generation method | |
CN115455945A (en) | Entity-relationship-based vulnerability data error correction method and system | |
CN114528459A (en) | Semantic-based webpage information extraction method and system | |
CN114519104A (en) | Action label labeling method and device | |
CN114881010A (en) | Chinese grammar error correction method based on Transformer and multitask learning | |
Kalyon et al. | A two phase smart code editor | |
Hung et al. | Application of Adaptive Neural Network Algorithm Model in English Text Analysis | |
CN113672737A (en) | Knowledge graph entity concept description generation system | |
CN116958752B (en) | Power grid infrastructure archiving method, device and equipment based on IPKCNN-SVM | |
CN115169330B (en) | Chinese text error correction and verification method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |