CN117113359A - Pre-training vulnerability restoration method based on countermeasure migration learning - Google Patents

Pre-training vulnerability restoration method based on countermeasure migration learning Download PDF

Info

Publication number
CN117113359A
CN117113359A CN202311135429.7A CN202311135429A CN117113359A CN 117113359 A CN117113359 A CN 117113359A CN 202311135429 A CN202311135429 A CN 202311135429A CN 117113359 A CN117113359 A CN 117113359A
Authority
CN
China
Prior art keywords
token
code
vulnerability
code generator
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311135429.7A
Other languages
Chinese (zh)
Other versions
CN117113359B (en
Inventor
黄诚
侯靖
韦英炜
李乐融
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202311135429.7A priority Critical patent/CN117113359B/en
Publication of CN117113359A publication Critical patent/CN117113359A/en
Application granted granted Critical
Publication of CN117113359B publication Critical patent/CN117113359B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Abstract

The invention discloses a pre-training vulnerability restoration method based on countermeasure transfer learning, which comprises the following steps: a code generator that constructs a shallow encoder-deep decoder architecture; word segmentation is carried out on a large code data set of the function level by using a Unigram LM word segmentation device; pre-training the code generator using an improved causal language modeling technique, span de-noising technique; extracting an encoder set of a pre-trained code generator to construct a arbiter; constructing and generating an countermeasure network by combining a pre-trained code generator and a discriminator; performing word segmentation on the vulnerability restoration data set at the function level by using a pre-trained word segmentation device; obtaining an optimal code generator by utilizing the generated countermeasure network countermeasure training; and inputting the vulnerability codes of the functional level to be repaired into a pre-trained word segmentation device, and obtaining a repair sequence by using an optimal code generator. According to the method, vulnerability restoration is performed by using anti-migration learning, so that generalization and robustness of the model are improved, the cost of software vulnerability restoration is reduced, and the accuracy of software vulnerability restoration is improved.

Description

Pre-training vulnerability restoration method based on countermeasure migration learning
Technical Field
The invention belongs to the field of software debugging, and particularly relates to a pre-training vulnerability restoration method based on anti-migration learning.
Background
With the increase of the number and complexity of the software vulnerabilities, developers need to know about the software vulnerabilities in depth, influence on system functions is reduced as much as possible, and cost of repairing the software vulnerabilities is greatly increased. In order to reduce the software bug fix cost, researchers have proposed techniques for automatically fixing software bugs. But the vulnerability restoration data set which can be acquired from the internet is small in scale, and great challenges are brought to researchers.
The university of Yangzhou proposes a technology for automatically repairing a bug code by using a syntax tree characterization code in a patent document (a tree-based bug repairing system and repairing method) (patent application number: 202210027014.7, publication number: CN 114547619A) filed by Yangzhou university. Firstly, collecting a bug repair data set on Github, converting codes in the bug repair data set into grammar tree AST with data flow dependence and control flow dependence, abstracting and normalizing the grammar tree AST to obtain a token sequence, dividing the token sequence into a training set and a testing set, and inputting the training set and the testing set into a Transformer model with the same number of encoders and decoders to train and test. The invention realizes automatic restoration of codes by using a grammar tree and a transducer model, and improves the efficiency of code restoration. However, the method still has the defects that:
(1) According to the method, model training is carried out only by relying on the vulnerability restoration data set, under the current situation that the vulnerability restoration data set is small in scale, the number of vulnerabilities of individual CWE types in the data set is small or interference is strong, and when the model does not learn the characteristics of the vulnerability completely, the model is poor in performance, and generalization and robustness are weakened;
(2) When the method is used for abstracting and normalizing a code dataset, the function name, the variable and the value of the data are replaced, so that a model cannot learn the potential semantic meaning of the code, and the code understanding capability of the model is poor;
(3) The method is excessively dependent on a transducer model to generate a repair code, and the model can repair part of correct codes in error, so that the model is over-fitted.
The invention provides a pre-training vulnerability restoration method based on countermeasure migration learning, which has the advantages that:
(1) The pre-trained code generator model is obtained after the pre-training is carried out on the large code data set, so that the model has better code understanding capability, code generation capability and complementation capability;
(2) The method and the system have the advantages that the pre-trained code generator model is finely adjusted on the vulnerability restoration data set by generating the countermeasure network architecture, and the anti-interference capability and the restoration capability of the model are improved by generating the countermeasure training mechanism of the countermeasure network, so that the model has higher robustness and generalization, and meanwhile, the problem of model overfitting is solved.
According to the method, the pre-trained code generator model is obtained after pre-training is performed on the large code data set, then the vulnerability restoration data set is directly used for countermeasure training, so that the model reduces the dependence of source domain data, the model can better adapt to the data and feature distribution of the target domain, the difference between the source domain and the target domain is reduced, the training speed of the model is improved, and finally the vulnerability restoration accuracy of the model is improved.
Disclosure of Invention
The invention aims to: the invention aims to design a bug repairing method with strong generalization capability, strong robustness and high repairing accuracy so as to adapt to the current situation of small size of bug repairing data sets.
The technical scheme is as follows: in order to solve the technical problems, the invention designs a pre-training vulnerability restoration method based on anti-migration learning, which comprises the following steps:
s100, constructing a code generator model of a shallow encoder-deep decoder architecture;
s200, based on the step S100, using a large code data set of a function level to pretrain the code generator model by using an improved pretraining technology, so as to obtain a pretrained code generator model;
s300, based on the step S200, extracting an encoder set of the code generator model to construct a discriminator model;
s400, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminant model based on the step S200 and the step S300; retraining the generated countermeasure network by utilizing the vulnerability restoration data set of the function level to obtain an optimal code generator model suitable for restoring vulnerability codes;
s500, inputting the vulnerability codes of the function level into the optimal code generator model based on the step S400 to obtain the repaired codes.
Further, in step S100, step S100 specifically includes:
the encoder and decoder are based on the encoder and decoder in the CodeT5 model, and the shallow encoder-deep decoder architecture represents a greater number of decoders than encoders in the code generator model.
Further, in step S200, step S200 includes the steps of:
s210, converting the large code data set of the function level into a code token sequence by using an initial Unigram LM (unified language model) word segmentation device to obtain a pre-trained word segmentation device and the code token sequence;
s220, based on the step S100 and the step S210, performing first-step pre-training on the code generator model by utilizing an improved causal language modeling technology to obtain a preliminary pre-trained code generator model;
s230, based on the step S210 and the step S220, performing second-step pre-training on the preliminary pre-trained code generator model by utilizing an improved Span de-noising technology to obtain a pre-trained code generator model;
wherein the improved Span de-noising technique comprises:
10% of the TOKEN "[ TOKEN 0], [ TOKEN n ]" is a predefined TOKEN "[ LABEL 0], [ LABEL n ]" is replaced with 50% probability in the input TOKEN sequence of the encoder, and a special TOKEN "[ SOM ]" is added before it; adding special token "[ EOM ]" as target token sequence output by decoder before correct token sequence; the decoder is caused to generate a replaced TOKEN sequence, [ TOKEN 0], [ TOKEN n ], "to obtain a pre-trained code generator model.
Further, in step S220, step S220 includes the steps of:
s221, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOB ]" after the token sequence preceding the selected token; taking the token sequence added with the special token as a model input, and taking the token sequence after the selected token as a model output;
s222, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOF ]" before the token sequence following the selected token; and taking the token sequence added with the special token as a model input, and taking the token sequence before the selected token as a model output to obtain a preliminary pre-trained code generator model.
Further, in step S300, step S300 includes the steps of:
s310, extracting encoders of the pre-trained code generator model based on the step S200 to obtain an encoder set;
wherein the encoder set comprises parameters of the pre-trained code generator model encoder set;
s320, based on the step S310, combining the encoder set with a linear change layer and an output layer to obtain a discriminator model;
further, in step S400, step S400 includes the steps of:
s410, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminator model based on the step S200 and the step S300;
s420, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability repair data set at the function level to obtain a vulnerability code token sequence and a repair code token sequence;
s430, inputting the vulnerability code token sequence and the repair code token sequence into the code generator model of the generated countermeasure network at the same time based on the step S410 and the step S420 to obtain a generated probability sequence;
meanwhile, the code generator model learns the difference between the generated probability sequence and the input repair code token sequence to obtain a loss value a;
s440, based on the step S410, the step S420 and the step S430, utilizing a nucleic Sampling algorithm to optimally arrange the generated probability sequence to obtain a bug code repairing token sequence;
meanwhile, inputting the repair code token sequence and the vulnerability code repair token sequence into a discriminator model for generating an countermeasure network, and learning the difference between the repair code token sequence and the vulnerability code repair token sequence by the discriminator model to obtain a loss value b;
s450, optimizing the code generator model by the optimizer according to the loss value a and the loss value b to obtain an optimal code generator model.
Further, in step S500, step S500 includes the steps of:
s510, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability codes at the function level to obtain a token sequence of the vulnerability codes to be repaired;
s520, inputting the code token sequence of the vulnerability to be repaired into an optimal code generator model based on the step S400 and the step S510 to obtain a probability sequence of the repair code;
s530, based on the step S520, optimally arranging the repair code probability sequences by using a nucleous Sampling algorithm again to obtain the repaired codes.
Compared with the prior art, the invention has the beneficial effects that:
(1) The shallow encoder-deep decoder architecture used in the invention performs probability sequence generation, and compared with a converter model with consistent numbers of encoders and decoders, the shallow encoder-deep decoder architecture has better performance on code generation tasks;
(2) Compared with the method that the large code data set with the function level is used for training a transducer model after abstracting and normalizing the vulnerability restoration data set, the method can learn a wider code structure, semantic meaning and characteristics so as to adapt to the task of restoring the vulnerability under the condition that the vulnerability restoration data set is small;
(3) Compared with direct training of a transducer model, the method and the device for model training by using the anti-migration learning can use the generated error repair code for reversely training the code generator model, and the migration learning can migrate knowledge in the code generation field to the vulnerability code repair field, so that the robustness and generalization of the model are improved.
Drawings
FIG. 1 is a system flow diagram of the present invention;
FIG. 2 is a flow chart of one embodiment of a pre-trained code generator model in the present invention;
FIG. 3 is a flow chart of one embodiment of the present invention for building a model for generating and training an optimal code generator;
FIG. 4 is a schematic diagram of one embodiment of a generated countermeasure network constructed in the present invention;
FIG. 5 is a flow chart of one embodiment of repairing vulnerability code to be repaired in accordance with the present invention.
Detailed Description
The technical scheme of the present invention will be clearly and completely described in the following with reference to the accompanying drawings and examples. The following examples or figures are illustrative of the invention and are not intended to limit the scope of the invention.
Referring to fig. 1-5, the present embodiment provides a pre-training vulnerability restoration method based on anti-migration learning, including:
in one embodiment, as shown in fig. 1, fig. 1 is a system flow of the present invention.
S100, constructing a code generator model of a shallow encoder-deep decoder architecture.
Specifically, the encoder and decoder are based on the encoder and decoder in the CodeT5 model, the architecture is a T5 architecture, the shallow encoder-deep decoder architecture represents that the number of decoders in the code generator model is greater than the number of encoders, the encoders and decoders in the code generator model are connected through a cross attention layer, and the last decoder in the code generator model is connected with a linear change layer and an output layer.
For example, the shallow encoder-deep decoder architecture of the code generator model may set the number of encoders to 12 and the number of decoders to 18;
the linear change layer may adopt a neural network suitable for a regression task, and the output layer may adopt a neural network suitable for a text generation task.
For example, the linear change layer adopts a fully connected layer neural network, and the activation function of the linear change layer can adopt a ReLU activation function; the output layer may employ a Softmax probability output layer.
S200, based on the step S100, the code generator model is pre-trained by using an improved pre-training technology through a large code data set with a function level, and the pre-trained code generator model is obtained.
In one embodiment, as shown in fig. 2, the step S200 includes the following steps, which is how to pretrain the code generator model in fig. 2:
s210, converting the large code data set of the function level into a code token sequence by using an initial Unigram LM (unified language model) word segmentation device to obtain a pre-trained word segmentation device and the code token sequence;
for example, the large Code data set at the function level includes a codeSearchNet data set, a Github-Code data set and the like disclosed in the field of software engineering, or a data set in which different large Code data sets such as the codeSearchNet data set and the Github-Code data set are combined and then duplicated;
for example, the sentence "This is a test" is input to a word segmentation device, and the code token sequence obtained after word segmentation is: "'_ Thi','s','_ is','_ a','_ t', 'est', 'term'.
S220, based on the step S100 and the step S210, performing first-step pre-training on the code generator model by utilizing an improved causal language modeling technology to obtain a preliminary pre-trained code generator model;
wherein the improved causal language modeling technique is divided into two steps;
s221, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOB ]" after the token sequence preceding the selected token; taking the token sequence added with the special token as a model input, and taking the token sequence after the selected token as a model output;
for example, for the code token sequence "_thi ','s ', ' is ', ' a ','t ', ' est ', ' term ', the selected token is" _is ", the input token sequence of the model is: the output token sequence of the model is "_Thi ','s ', ' GOB ] '", which is: "_ a '," _ t ', "est '," v.
S222, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOF ]" before the token sequence following the selected token; taking the token sequence added with the special token as a model input, and taking the token sequence before the selected token as a model output to obtain a preliminary pre-trained code generator model;
for example, for the code token sequence "_thi ','s ', ' is ', ' a ','t ', ' est ', ' a ', the selected token is" _a ", the input token sequence of the model is: the output token sequence of the model is "" ' [ GOF ] "," t ', "est '," v: "'_ Thi','s','_ is'".
S230, based on the step S210 and the step S220, performing second-step pre-training on the preliminary pre-trained code generator model by utilizing an improved Span de-noising technology to obtain a pre-trained code generator model;
wherein the improved Span de-noising technique comprises:
10% of the TOKEN "[ TOKEN 0], [ TOKEN n ]" is a predefined TOKEN "[ LABEL 0], [ LABEL n ]" is replaced with 50% probability in the input TOKEN sequence of the encoder, and a special TOKEN "[ SOM ]" is added before it; adding special token "[ EOM ]" as target token sequence output by decoder before correct token sequence; the decoder is caused to generate a replaced TOKEN sequence, [ TOKEN 0], [ TOKEN n ], "to obtain a pre-trained code generator model.
For example, the input token sequence of the encoder is: "'_ Thi','s','_ is','_ a','_ t', 'est','' token is replaced: "'_is'", the token sequence after substitution is: the target token sequence output by the decoder is "" '_ Thi', `s ', `SOM ]', `LABEL 0] ', `a', `t ', `est', `as`). "' _ Thi ','s ',' [ EOM ] ', ' _ is ',' _ a ',' _ t ', ' est ', '.
S300, based on the step S200, extracting an encoder set of the code generator model to construct a discriminator model.
In one embodiment, the step S300 includes the steps of:
s310, extracting encoders of the pre-trained code generator model based on the step S200 to obtain an encoder set;
wherein the encoder set comprises parameters of the pre-trained code generator model encoder set;
s320, based on the step S310, the encoder set, the linear change layer and the output layer are combined to obtain a discriminator model.
For example, the linear variable layer and the output layer may employ the same full connection layer and Softmax probability output layer as in step S100.
S400, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminant model based on the step S200 and the step S300; and retraining the generated countermeasure network by utilizing the vulnerability restoration data set of the function level to obtain an optimal code generator model suitable for restoring the vulnerability codes.
In one embodiment, as shown in fig. 3 and fig. 4, fig. 3 is a schematic diagram of how to construct a generated countermeasure network and train an optimal code generator model, fig. 4 is a schematic diagram of one embodiment of the generated countermeasure network constructed in this step, and step S400 includes the following steps:
s410, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminator model based on the step S200 and the step S300;
s420, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability repair data set at the function level to obtain a vulnerability code token sequence and a repair code token sequence;
for example, the vulnerability restoration data set of the function level includes a CVEfix data set, a Big-Vul data set and the like disclosed in the software security field, or a vulnerability restoration data set in which different vulnerability restoration data sets such as the CVEfix data set and the Big-Vul data set are combined and de-duplicated.
S430, inputting the vulnerability code token sequence and the repair code token sequence into the code generator model of the generated countermeasure network at the same time based on the step S410 and the step S420 to obtain a generated probability sequence;
meanwhile, the code generator model learns the difference between the generated probability sequence and the input repair code token sequence to obtain a loss value a.
For example, the loss value a may be learned using a cross entropy loss function.
S440, based on the step S410, the step S420 and the step S430, utilizing a nucleic Sampling algorithm to optimally arrange the generated probability sequence to obtain a bug code repairing token sequence;
and simultaneously, inputting the repair code token sequence and the vulnerability code repair token sequence into a discriminator model for generating an countermeasure network, and learning the difference between the repair code token sequence and the vulnerability code repair token sequence by the discriminator model to obtain a loss value b.
For example, when the probability sequence is optimally arranged by using a nucleos Sampling algorithm to obtain a bug code repairing token sequence, setting top_p=0.9 (the parameter represents an accumulated probability value), max_length=50 (the parameter represents the maximum length of a generated sequence), temperature=0.8 (the parameter represents the parameter of the smoothness degree of probability distribution in the Sampling process), num_return_sequences=50 (the parameter represents the number of generated sequences), and generating 50 bug code repairing sequences after the parameter is set, wherein the maximum length of each sequence is 50 tokens; the loss value b may be learned using a cross entropy loss function.
S450, optimizing the code generator model by the optimizer according to the loss value a and the loss value b to obtain an optimal code generator model.
For example, the optimizer may use an AdamW optimizer to train the above-described generation antagonistic neural network for 100 periods in a batch size of 8, with a learning rate set to 2e-5, a weight decay set to 1e-4, a warm-up step number set to 200, and a gradient accumulation step number set to 8, resulting in an optimal code generator model.
S500, inputting the vulnerability codes of the function level into the optimal code generator model based on the step S400 to obtain the repaired codes.
In one embodiment, as shown in fig. 5, the step S500 includes the following steps in how to repair the bug code to be repaired, where fig. 5 is as follows:
s510, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability codes at the function level to obtain a token sequence of the vulnerability codes to be repaired;
s520, inputting the code token sequence of the vulnerability to be repaired into an optimal code generator model based on the step S400 and the step S510 to obtain a probability sequence of the repair code;
s530, based on the step S520, optimally arranging the repair code probability sequences by using a nucleous Sampling algorithm again to obtain the repaired codes.
For example, when the repaired code is obtained by using the nucleous Sampling algorithm again, a top_p=0.9 (the parameter represents the cumulative probability value), a max_length=50 (the parameter represents the maximum length of the generated sequence), a temperature=0.8 (the parameter represents the parameter of the smoothness of the probability distribution in the Sampling process), and a num_return_sequences=5 (the parameter represents the number of the generated sequences) of the nucleous Sampling algorithm are set, and after the parameter is set, 5 bug code repair sequences are generated, wherein the maximum length of each sequence is 50 token.
Finally, it should be noted that: the foregoing description is only one embodiment of the present invention, and the present invention is not limited to the foregoing embodiments, but may be modified or substituted for some of the technical features described in the foregoing embodiments by those skilled in the art.
Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. The pre-training vulnerability restoration method based on the challenge migration learning is characterized by comprising the following steps of:
s100, constructing a code generator model of a shallow encoder-deep decoder architecture;
s200, based on the step S100, using a large code data set of a function level to pretrain the code generator model by using an improved pretraining technology, so as to obtain a pretrained code generator model;
s300, based on the step S200, extracting an encoder set of the code generator model to construct a discriminator model;
s400, constructing and generating an countermeasure network by utilizing the pre-trained code generator model and the discriminant model based on the step S200 and the step S300; retraining the generated countermeasure network by utilizing the vulnerability restoration data set of the function level to obtain an optimal code generator model suitable for restoring vulnerability codes;
s500, inputting the vulnerability codes of the function level into the optimal code generator model based on the step S400 to obtain the repaired codes.
2. The vulnerability restoration method based on anti-migration learning of claim 1, wherein the code generator model: a shallow encoder-deep decoder architecture;
wherein the encoder and decoder are based on encoders and decoders in a CodeT5 model, and the shallow encoder-deep decoder architecture represents a greater number of decoders than encoders in a code generator model.
3. The vulnerability restoration method based on the challenge migration learning of claim 2, wherein the step S200 comprises the steps of:
s210, converting the large code data set of the function level into a code token sequence by using an initial Unigram LM (unified language model) word segmentation device to obtain a pre-trained word segmentation device and the code token sequence;
s220, based on the step S100 and the step S210, performing first-step pre-training on the code generator model by utilizing an improved causal language modeling technology to obtain a preliminary pre-trained code generator model;
s230, based on the step S210 and the step S220, performing second-step pre-training on the preliminary pre-trained code generator model by utilizing an improved Span de-noising technology to obtain a pre-trained code generator model;
wherein the improved Span de-noising technique comprises:
10% of the TOKEN "[ TOKEN 0], [ TOKEN n ]" is a predefined TOKEN "[ LABEL 0], [ LABEL n ]" is replaced with 50% probability in the input code TOKEN sequence of the encoder, and a special TOKEN "[ SOM ]" is added before it; adding special token "[ EOM ]" as target token sequence output by decoder before correct token sequence; the decoder is caused to generate a replaced TOKEN "[ TOKEN 0], [ TOKEN n ]," resulting in a pre-trained code generator model.
4. The vulnerability restoration method based on anti-migration learning of claim 2, wherein the improved causal language modeling technique of step S220 comprises the steps of:
s221, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOB ]" after the token sequence preceding the selected token; taking the token sequence added with the special token as a model input, and taking the token sequence after the selected token as a model output;
s222, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOF ]" before the token sequence following the selected token; and taking the token sequence added with the special token as a model input, and taking the token sequence before the selected token as a model output to obtain a preliminary pre-trained code generator model.
5. The vulnerability restoration method based on anti-migration learning of claim 3, wherein the step S300 comprises the steps of:
s310, extracting encoders of the pre-trained code generator model based on the step S200 to obtain an encoder set;
wherein the encoder set comprises parameters of an encoder set in the pre-trained code generator model;
s320, based on the step S310, the encoder set, the linear change layer and the output layer are combined to obtain a discriminator model.
6. The vulnerability restoration method based on anti-migration learning of claim 4, wherein the step S400 comprises the steps of:
s410, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminator model based on the step S200 and the step S300;
s420, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability repair data set at the function level to obtain a vulnerability code token sequence and a repair code token sequence;
s430, inputting the vulnerability code token sequence and the repair code token sequence into the code generator model of the generated countermeasure network at the same time based on the step S410 and the step S420 to obtain a generated probability sequence;
meanwhile, the code generator model learns the difference between the generated probability sequence and the input repair code token sequence to obtain a loss value a;
s440, based on the step S410, the step S420 and the step S430, utilizing a nucleic Sampling algorithm to optimally arrange the generated probability sequence to obtain a bug code repairing token sequence;
meanwhile, inputting the repair code token sequence and the vulnerability code repair token sequence into a discriminator model for generating an countermeasure network, and learning the difference between the repair code token sequence and the vulnerability code repair token sequence by the discriminator model to obtain a loss value b;
s450, optimizing the code generator model by the optimizer according to the loss value a and the loss value b to obtain an optimal code generator model.
7. The vulnerability restoration method based on anti-migration learning of claim 5, wherein the step S500 comprises the steps of:
s510, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability codes at the function level to obtain a token sequence of the vulnerability codes to be repaired;
s520, inputting the code token sequence of the vulnerability to be repaired into an optimal code generator model based on the step S400 and the step S510 to obtain a probability sequence of the repair code;
s530, based on the step S520, optimally arranging the repair code probability sequences by using a nucleous Sampling algorithm again to obtain the repaired codes.
CN202311135429.7A 2023-09-05 2023-09-05 Pre-training vulnerability restoration method based on countermeasure migration learning Active CN117113359B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311135429.7A CN117113359B (en) 2023-09-05 2023-09-05 Pre-training vulnerability restoration method based on countermeasure migration learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311135429.7A CN117113359B (en) 2023-09-05 2023-09-05 Pre-training vulnerability restoration method based on countermeasure migration learning

Publications (2)

Publication Number Publication Date
CN117113359A true CN117113359A (en) 2023-11-24
CN117113359B CN117113359B (en) 2024-03-19

Family

ID=88805401

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311135429.7A Active CN117113359B (en) 2023-09-05 2023-09-05 Pre-training vulnerability restoration method based on countermeasure migration learning

Country Status (1)

Country Link
CN (1) CN117113359B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151081A1 (en) * 2017-11-13 2020-05-14 The Charles Stark Draper Laboratory, Inc. Automated Repair Of Bugs And Security Vulnerabilities In Software
WO2021148625A1 (en) * 2020-01-23 2021-07-29 Debricked Ab A method for identifying vulnerabilities in computer program code and a system thereof
US20210357307A1 (en) * 2020-05-15 2021-11-18 Microsoft Technology Licensing, Llc. Automated program repair tool
CN114048464A (en) * 2022-01-12 2022-02-15 北京大学 Ether house intelligent contract security vulnerability detection method and system based on deep learning
US20220092411A1 (en) * 2020-09-21 2022-03-24 Samsung Sds Co., Ltd. Data prediction method based on generative adversarial network and apparatus implementing the same method
CN114547619A (en) * 2022-01-11 2022-05-27 扬州大学 Vulnerability repairing system and method based on tree
US20220292200A1 (en) * 2021-03-10 2022-09-15 Huazhong University Of Science And Technology Deep-learning based device and method for detecting source-code vulnerability with improved robustness
CN115168865A (en) * 2022-06-28 2022-10-11 南京大学 Cross-item vulnerability detection model based on domain self-adaptation
CN115396156A (en) * 2022-07-29 2022-11-25 中国人民解放军国防科技大学 Vulnerability priority processing method based on deep reinforcement learning
CN116595530A (en) * 2022-12-08 2023-08-15 北京工业大学 Intelligent contract vulnerability detection method combining countermeasure migration learning and multitask learning
CN116628707A (en) * 2023-07-19 2023-08-22 山东省计算中心(国家超级计算济南中心) Interpretable multitasking-based source code vulnerability detection method

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200151081A1 (en) * 2017-11-13 2020-05-14 The Charles Stark Draper Laboratory, Inc. Automated Repair Of Bugs And Security Vulnerabilities In Software
WO2021148625A1 (en) * 2020-01-23 2021-07-29 Debricked Ab A method for identifying vulnerabilities in computer program code and a system thereof
US20210357307A1 (en) * 2020-05-15 2021-11-18 Microsoft Technology Licensing, Llc. Automated program repair tool
US20220092411A1 (en) * 2020-09-21 2022-03-24 Samsung Sds Co., Ltd. Data prediction method based on generative adversarial network and apparatus implementing the same method
US20220292200A1 (en) * 2021-03-10 2022-09-15 Huazhong University Of Science And Technology Deep-learning based device and method for detecting source-code vulnerability with improved robustness
CN114547619A (en) * 2022-01-11 2022-05-27 扬州大学 Vulnerability repairing system and method based on tree
CN114048464A (en) * 2022-01-12 2022-02-15 北京大学 Ether house intelligent contract security vulnerability detection method and system based on deep learning
CN115168865A (en) * 2022-06-28 2022-10-11 南京大学 Cross-item vulnerability detection model based on domain self-adaptation
CN115396156A (en) * 2022-07-29 2022-11-25 中国人民解放军国防科技大学 Vulnerability priority processing method based on deep reinforcement learning
CN116595530A (en) * 2022-12-08 2023-08-15 北京工业大学 Intelligent contract vulnerability detection method combining countermeasure migration learning and multitask learning
CN116628707A (en) * 2023-07-19 2023-08-22 山东省计算中心(国家超级计算济南中心) Interpretable multitasking-based source code vulnerability detection method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ZHAO QIANCHONG ET AL.: "VULDEFF: Vulnerability detection method based on function fingerprints and code differences", KNOWLEDGE-BASED SYSTEMS, vol. 260, 25 January 2023 (2023-01-25) *
刘嘉勇 等: "源代码漏洞静态分析技术", 信息安全学报, vol. 7, no. 4, 15 July 2022 (2022-07-15) *
李元诚;崔亚奇;吕俊峰;来风刚;张攀;: "开源软件漏洞检测的混合深度学习方法", 计算机工程与应用, no. 11, 17 December 2018 (2018-12-17) *
李韵;黄辰林;王中锋;袁露;王晓川;: "基于机器学习的软件漏洞挖掘方法综述", 软件学报, no. 07, 15 July 2020 (2020-07-15) *
陈肇炫;邹德清;李珍;金海;: "基于抽象语法树的智能化漏洞检测系统", 信息安全学报, no. 04, 15 July 2020 (2020-07-15) *

Also Published As

Publication number Publication date
CN117113359B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
CN112215013B (en) Clone code semantic detection method based on deep learning
CN112364125B (en) Text information extraction system and method combining reading course learning mechanism
WO2021015936A1 (en) Word-overlap-based clustering cross-modal retrieval
CN112507337A (en) Implementation method of malicious JavaScript code detection model based on semantic analysis
CN101751385A (en) Multilingual information extraction method adopting hierarchical pipeline filter system structure
CA3135717A1 (en) System and method for transferable natural language interface
CN113822054A (en) Chinese grammar error correction method and device based on data enhancement
CN114742069A (en) Code similarity detection method and device
CN114048314A (en) Natural language steganalysis method
CN117113359B (en) Pre-training vulnerability restoration method based on countermeasure migration learning
CN113741886A (en) Statement level program repairing method and system based on graph
CN117421595A (en) System log anomaly detection method and system based on deep learning technology
CN116882402A (en) Multi-task-based electric power marketing small sample named entity identification method
CN116484851A (en) Pre-training model training method and device based on variant character detection
CN115495085A (en) Generation method and device based on deep learning fine-grained code template
CN116483314A (en) Automatic intelligent activity diagram generation method
CN115455945A (en) Entity-relationship-based vulnerability data error correction method and system
CN114528459A (en) Semantic-based webpage information extraction method and system
CN114519104A (en) Action label labeling method and device
CN114881010A (en) Chinese grammar error correction method based on Transformer and multitask learning
Kalyon et al. A two phase smart code editor
Hung et al. Application of Adaptive Neural Network Algorithm Model in English Text Analysis
CN113672737A (en) Knowledge graph entity concept description generation system
CN116958752B (en) Power grid infrastructure archiving method, device and equipment based on IPKCNN-SVM
CN115169330B (en) Chinese text error correction and verification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant