CN117113359A

CN117113359A - Pre-training vulnerability restoration method based on countermeasure migration learning

Info

Publication number: CN117113359A
Application number: CN202311135429.7A
Authority: CN
Inventors: 黄诚; 侯靖; 韦英炜; 李乐融
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2023-11-24
Anticipated expiration: 2043-09-05
Also published as: CN117113359B

Abstract

The invention discloses a pre-training vulnerability restoration method based on countermeasure transfer learning, which comprises the following steps: a code generator that constructs a shallow encoder-deep decoder architecture; word segmentation is carried out on a large code data set of the function level by using a Unigram LM word segmentation device; pre-training the code generator using an improved causal language modeling technique, span de-noising technique; extracting an encoder set of a pre-trained code generator to construct a arbiter; constructing and generating an countermeasure network by combining a pre-trained code generator and a discriminator; performing word segmentation on the vulnerability restoration data set at the function level by using a pre-trained word segmentation device; obtaining an optimal code generator by utilizing the generated countermeasure network countermeasure training; and inputting the vulnerability codes of the functional level to be repaired into a pre-trained word segmentation device, and obtaining a repair sequence by using an optimal code generator. According to the method, vulnerability restoration is performed by using anti-migration learning, so that generalization and robustness of the model are improved, the cost of software vulnerability restoration is reduced, and the accuracy of software vulnerability restoration is improved.

Description

Pre-training vulnerability restoration method based on countermeasure migration learning

Technical Field

The invention belongs to the field of software debugging, and particularly relates to a pre-training vulnerability restoration method based on anti-migration learning.

Background

With the increase of the number and complexity of the software vulnerabilities, developers need to know about the software vulnerabilities in depth, influence on system functions is reduced as much as possible, and cost of repairing the software vulnerabilities is greatly increased. In order to reduce the software bug fix cost, researchers have proposed techniques for automatically fixing software bugs. But the vulnerability restoration data set which can be acquired from the internet is small in scale, and great challenges are brought to researchers.

The university of Yangzhou proposes a technology for automatically repairing a bug code by using a syntax tree characterization code in a patent document (a tree-based bug repairing system and repairing method) (patent application number: 202210027014.7, publication number: CN 114547619A) filed by Yangzhou university. Firstly, collecting a bug repair data set on Github, converting codes in the bug repair data set into grammar tree AST with data flow dependence and control flow dependence, abstracting and normalizing the grammar tree AST to obtain a token sequence, dividing the token sequence into a training set and a testing set, and inputting the training set and the testing set into a Transformer model with the same number of encoders and decoders to train and test. The invention realizes automatic restoration of codes by using a grammar tree and a transducer model, and improves the efficiency of code restoration. However, the method still has the defects that:

(1) According to the method, model training is carried out only by relying on the vulnerability restoration data set, under the current situation that the vulnerability restoration data set is small in scale, the number of vulnerabilities of individual CWE types in the data set is small or interference is strong, and when the model does not learn the characteristics of the vulnerability completely, the model is poor in performance, and generalization and robustness are weakened;

(2) When the method is used for abstracting and normalizing a code dataset, the function name, the variable and the value of the data are replaced, so that a model cannot learn the potential semantic meaning of the code, and the code understanding capability of the model is poor;

(3) The method is excessively dependent on a transducer model to generate a repair code, and the model can repair part of correct codes in error, so that the model is over-fitted.

The invention provides a pre-training vulnerability restoration method based on countermeasure migration learning, which has the advantages that:

(1) The pre-trained code generator model is obtained after the pre-training is carried out on the large code data set, so that the model has better code understanding capability, code generation capability and complementation capability;

(2) The method and the system have the advantages that the pre-trained code generator model is finely adjusted on the vulnerability restoration data set by generating the countermeasure network architecture, and the anti-interference capability and the restoration capability of the model are improved by generating the countermeasure training mechanism of the countermeasure network, so that the model has higher robustness and generalization, and meanwhile, the problem of model overfitting is solved.

According to the method, the pre-trained code generator model is obtained after pre-training is performed on the large code data set, then the vulnerability restoration data set is directly used for countermeasure training, so that the model reduces the dependence of source domain data, the model can better adapt to the data and feature distribution of the target domain, the difference between the source domain and the target domain is reduced, the training speed of the model is improved, and finally the vulnerability restoration accuracy of the model is improved.

Disclosure of Invention

The invention aims to: the invention aims to design a bug repairing method with strong generalization capability, strong robustness and high repairing accuracy so as to adapt to the current situation of small size of bug repairing data sets.

The technical scheme is as follows: in order to solve the technical problems, the invention designs a pre-training vulnerability restoration method based on anti-migration learning, which comprises the following steps:

s100, constructing a code generator model of a shallow encoder-deep decoder architecture;

s200, based on the step S100, using a large code data set of a function level to pretrain the code generator model by using an improved pretraining technology, so as to obtain a pretrained code generator model;

s300, based on the step S200, extracting an encoder set of the code generator model to construct a discriminator model;

s400, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminant model based on the step S200 and the step S300; retraining the generated countermeasure network by utilizing the vulnerability restoration data set of the function level to obtain an optimal code generator model suitable for restoring vulnerability codes;

s500, inputting the vulnerability codes of the function level into the optimal code generator model based on the step S400 to obtain the repaired codes.

Further, in step S100, step S100 specifically includes:

the encoder and decoder are based on the encoder and decoder in the CodeT5 model, and the shallow encoder-deep decoder architecture represents a greater number of decoders than encoders in the code generator model.

Further, in step S200, step S200 includes the steps of:

s210, converting the large code data set of the function level into a code token sequence by using an initial Unigram LM (unified language model) word segmentation device to obtain a pre-trained word segmentation device and the code token sequence;

s220, based on the step S100 and the step S210, performing first-step pre-training on the code generator model by utilizing an improved causal language modeling technology to obtain a preliminary pre-trained code generator model;

s230, based on the step S210 and the step S220, performing second-step pre-training on the preliminary pre-trained code generator model by utilizing an improved Span de-noising technology to obtain a pre-trained code generator model;

wherein the improved Span de-noising technique comprises:

10% of the TOKEN "[ TOKEN 0], [ TOKEN n ]" is a predefined TOKEN "[ LABEL 0], [ LABEL n ]" is replaced with 50% probability in the input TOKEN sequence of the encoder, and a special TOKEN "[ SOM ]" is added before it; adding special token "[ EOM ]" as target token sequence output by decoder before correct token sequence; the decoder is caused to generate a replaced TOKEN sequence, [ TOKEN 0], [ TOKEN n ], "to obtain a pre-trained code generator model.

Further, in step S220, step S220 includes the steps of:

s221, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOB ]" after the token sequence preceding the selected token; taking the token sequence added with the special token as a model input, and taking the token sequence after the selected token as a model output;

s222, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOF ]" before the token sequence following the selected token; and taking the token sequence added with the special token as a model input, and taking the token sequence before the selected token as a model output to obtain a preliminary pre-trained code generator model.

Further, in step S300, step S300 includes the steps of:

s310, extracting encoders of the pre-trained code generator model based on the step S200 to obtain an encoder set;

wherein the encoder set comprises parameters of the pre-trained code generator model encoder set;

s320, based on the step S310, combining the encoder set with a linear change layer and an output layer to obtain a discriminator model;

further, in step S400, step S400 includes the steps of:

s410, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminator model based on the step S200 and the step S300;

s420, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability repair data set at the function level to obtain a vulnerability code token sequence and a repair code token sequence;

s430, inputting the vulnerability code token sequence and the repair code token sequence into the code generator model of the generated countermeasure network at the same time based on the step S410 and the step S420 to obtain a generated probability sequence;

meanwhile, the code generator model learns the difference between the generated probability sequence and the input repair code token sequence to obtain a loss value a;

s440, based on the step S410, the step S420 and the step S430, utilizing a nucleic Sampling algorithm to optimally arrange the generated probability sequence to obtain a bug code repairing token sequence;

meanwhile, inputting the repair code token sequence and the vulnerability code repair token sequence into a discriminator model for generating an countermeasure network, and learning the difference between the repair code token sequence and the vulnerability code repair token sequence by the discriminator model to obtain a loss value b;

s450, optimizing the code generator model by the optimizer according to the loss value a and the loss value b to obtain an optimal code generator model.

Further, in step S500, step S500 includes the steps of:

s510, based on the step S210, utilizing the pre-trained word segmentation device to segment the vulnerability codes at the function level to obtain a token sequence of the vulnerability codes to be repaired;

s520, inputting the code token sequence of the vulnerability to be repaired into an optimal code generator model based on the step S400 and the step S510 to obtain a probability sequence of the repair code;

s530, based on the step S520, optimally arranging the repair code probability sequences by using a nucleous Sampling algorithm again to obtain the repaired codes.

Compared with the prior art, the invention has the beneficial effects that:

(1) The shallow encoder-deep decoder architecture used in the invention performs probability sequence generation, and compared with a converter model with consistent numbers of encoders and decoders, the shallow encoder-deep decoder architecture has better performance on code generation tasks;

(2) Compared with the method that the large code data set with the function level is used for training a transducer model after abstracting and normalizing the vulnerability restoration data set, the method can learn a wider code structure, semantic meaning and characteristics so as to adapt to the task of restoring the vulnerability under the condition that the vulnerability restoration data set is small;

(3) Compared with direct training of a transducer model, the method and the device for model training by using the anti-migration learning can use the generated error repair code for reversely training the code generator model, and the migration learning can migrate knowledge in the code generation field to the vulnerability code repair field, so that the robustness and generalization of the model are improved.

Drawings

FIG. 1 is a system flow diagram of the present invention;

FIG. 2 is a flow chart of one embodiment of a pre-trained code generator model in the present invention;

FIG. 3 is a flow chart of one embodiment of the present invention for building a model for generating and training an optimal code generator;

FIG. 4 is a schematic diagram of one embodiment of a generated countermeasure network constructed in the present invention;

FIG. 5 is a flow chart of one embodiment of repairing vulnerability code to be repaired in accordance with the present invention.

Detailed Description

The technical scheme of the present invention will be clearly and completely described in the following with reference to the accompanying drawings and examples. The following examples or figures are illustrative of the invention and are not intended to limit the scope of the invention.

Referring to fig. 1-5, the present embodiment provides a pre-training vulnerability restoration method based on anti-migration learning, including:

in one embodiment, as shown in fig. 1, fig. 1 is a system flow of the present invention.

S100, constructing a code generator model of a shallow encoder-deep decoder architecture.

Specifically, the encoder and decoder are based on the encoder and decoder in the CodeT5 model, the architecture is a T5 architecture, the shallow encoder-deep decoder architecture represents that the number of decoders in the code generator model is greater than the number of encoders, the encoders and decoders in the code generator model are connected through a cross attention layer, and the last decoder in the code generator model is connected with a linear change layer and an output layer.

For example, the shallow encoder-deep decoder architecture of the code generator model may set the number of encoders to 12 and the number of decoders to 18;

the linear change layer may adopt a neural network suitable for a regression task, and the output layer may adopt a neural network suitable for a text generation task.

For example, the linear change layer adopts a fully connected layer neural network, and the activation function of the linear change layer can adopt a ReLU activation function; the output layer may employ a Softmax probability output layer.

S200, based on the step S100, the code generator model is pre-trained by using an improved pre-training technology through a large code data set with a function level, and the pre-trained code generator model is obtained.

In one embodiment, as shown in fig. 2, the step S200 includes the following steps, which is how to pretrain the code generator model in fig. 2:

for example, the large Code data set at the function level includes a codeSearchNet data set, a Github-Code data set and the like disclosed in the field of software engineering, or a data set in which different large Code data sets such as the codeSearchNet data set and the Github-Code data set are combined and then duplicated;

for example, the sentence "This is a test" is input to a word segmentation device, and the code token sequence obtained after word segmentation is: "'_ Thi','s','_ is','_ a','_ t', 'est', 'term'.

wherein the improved causal language modeling technique is divided into two steps;

for example, for the code token sequence "_thi ','s ', ' is ', ' a ','t ', ' est ', ' term ', the selected token is" _is ", the input token sequence of the model is: the output token sequence of the model is "_Thi ','s ', ' GOB ] '", which is: "_ a '," _ t ', "est '," v.

S222, selecting a token according to 50% probability from 5% to 100% in the code token sequence; adding a special token "[ GOF ]" before the token sequence following the selected token; taking the token sequence added with the special token as a model input, and taking the token sequence before the selected token as a model output to obtain a preliminary pre-trained code generator model;

for example, for the code token sequence "_thi ','s ', ' is ', ' a ','t ', ' est ', ' a ', the selected token is" _a ", the input token sequence of the model is: the output token sequence of the model is "" ' [ GOF ] "," t ', "est '," v: "'_ Thi','s','_ is'".

wherein the improved Span de-noising technique comprises:

For example, the input token sequence of the encoder is: "'_ Thi','s','_ is','_ a','_ t', 'est','' token is replaced: "'_is'", the token sequence after substitution is: the target token sequence output by the decoder is "" '_ Thi', `s ', `SOM ]', `LABEL 0] ', `a', `t ', `est', `as`). "' _ Thi ','s ',' [ EOM ] ', ' _ is ',' _ a ',' _ t ', ' est ', '.

S300, based on the step S200, extracting an encoder set of the code generator model to construct a discriminator model.

In one embodiment, the step S300 includes the steps of:

s320, based on the step S310, the encoder set, the linear change layer and the output layer are combined to obtain a discriminator model.

For example, the linear variable layer and the output layer may employ the same full connection layer and Softmax probability output layer as in step S100.

S400, constructing and generating an countermeasure network by utilizing the pre-training code generator model and the discriminant model based on the step S200 and the step S300; and retraining the generated countermeasure network by utilizing the vulnerability restoration data set of the function level to obtain an optimal code generator model suitable for restoring the vulnerability codes.

In one embodiment, as shown in fig. 3 and fig. 4, fig. 3 is a schematic diagram of how to construct a generated countermeasure network and train an optimal code generator model, fig. 4 is a schematic diagram of one embodiment of the generated countermeasure network constructed in this step, and step S400 includes the following steps:

for example, the vulnerability restoration data set of the function level includes a CVEfix data set, a Big-Vul data set and the like disclosed in the software security field, or a vulnerability restoration data set in which different vulnerability restoration data sets such as the CVEfix data set and the Big-Vul data set are combined and de-duplicated.

meanwhile, the code generator model learns the difference between the generated probability sequence and the input repair code token sequence to obtain a loss value a.

For example, the loss value a may be learned using a cross entropy loss function.

and simultaneously, inputting the repair code token sequence and the vulnerability code repair token sequence into a discriminator model for generating an countermeasure network, and learning the difference between the repair code token sequence and the vulnerability code repair token sequence by the discriminator model to obtain a loss value b.

For example, when the probability sequence is optimally arranged by using a nucleos Sampling algorithm to obtain a bug code repairing token sequence, setting top_p=0.9 (the parameter represents an accumulated probability value), max_length=50 (the parameter represents the maximum length of a generated sequence), temperature=0.8 (the parameter represents the parameter of the smoothness degree of probability distribution in the Sampling process), num_return_sequences=50 (the parameter represents the number of generated sequences), and generating 50 bug code repairing sequences after the parameter is set, wherein the maximum length of each sequence is 50 tokens; the loss value b may be learned using a cross entropy loss function.

For example, the optimizer may use an AdamW optimizer to train the above-described generation antagonistic neural network for 100 periods in a batch size of 8, with a learning rate set to 2e-5, a weight decay set to 1e-4, a warm-up step number set to 200, and a gradient accumulation step number set to 8, resulting in an optimal code generator model.

In one embodiment, as shown in fig. 5, the step S500 includes the following steps in how to repair the bug code to be repaired, where fig. 5 is as follows:

For example, when the repaired code is obtained by using the nucleous Sampling algorithm again, a top_p=0.9 (the parameter represents the cumulative probability value), a max_length=50 (the parameter represents the maximum length of the generated sequence), a temperature=0.8 (the parameter represents the parameter of the smoothness of the probability distribution in the Sampling process), and a num_return_sequences=5 (the parameter represents the number of the generated sequences) of the nucleous Sampling algorithm are set, and after the parameter is set, 5 bug code repair sequences are generated, wherein the maximum length of each sequence is 50 token.

Finally, it should be noted that: the foregoing description is only one embodiment of the present invention, and the present invention is not limited to the foregoing embodiments, but may be modified or substituted for some of the technical features described in the foregoing embodiments by those skilled in the art.

Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The pre-training vulnerability restoration method based on the challenge migration learning is characterized by comprising the following steps of:

s400, constructing and generating an countermeasure network by utilizing the pre-trained code generator model and the discriminant model based on the step S200 and the step S300; retraining the generated countermeasure network by utilizing the vulnerability restoration data set of the function level to obtain an optimal code generator model suitable for restoring vulnerability codes;

2. The vulnerability restoration method based on anti-migration learning of claim 1, wherein the code generator model: a shallow encoder-deep decoder architecture;

wherein the encoder and decoder are based on encoders and decoders in a CodeT5 model, and the shallow encoder-deep decoder architecture represents a greater number of decoders than encoders in a code generator model.

3. The vulnerability restoration method based on the challenge migration learning of claim 2, wherein the step S200 comprises the steps of:

wherein the improved Span de-noising technique comprises:

10% of the TOKEN "[ TOKEN 0], [ TOKEN n ]" is a predefined TOKEN "[ LABEL 0], [ LABEL n ]" is replaced with 50% probability in the input code TOKEN sequence of the encoder, and a special TOKEN "[ SOM ]" is added before it; adding special token "[ EOM ]" as target token sequence output by decoder before correct token sequence; the decoder is caused to generate a replaced TOKEN "[ TOKEN 0], [ TOKEN n ]," resulting in a pre-trained code generator model.

4. The vulnerability restoration method based on anti-migration learning of claim 2, wherein the improved causal language modeling technique of step S220 comprises the steps of:

5. The vulnerability restoration method based on anti-migration learning of claim 3, wherein the step S300 comprises the steps of:

wherein the encoder set comprises parameters of an encoder set in the pre-trained code generator model;

6. The vulnerability restoration method based on anti-migration learning of claim 4, wherein the step S400 comprises the steps of:

7. The vulnerability restoration method based on anti-migration learning of claim 5, wherein the step S500 comprises the steps of: