CN112270358A

CN112270358A - Code annotation generation model robustness improving method based on deep learning

Info

Publication number: CN112270358A
Application number: CN202011178831.XA
Authority: CN
Inventors: 周宇; 张晓晴; 沈娟娟
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-01-26

Abstract

The invention discloses a code annotation generation model robustness improving method based on deep learning, which comprises the following steps: training a deep learning-based code annotation generation model and a deep neural network-based code coding network by using the disclosed data set; generating a countermeasure sample by using the trained code annotation generation model and the trained code coding network and adopting a variable name replacement-based method for each piece of data on a training data set in the used data set; and (3) matching the confrontation samples on the generated training set with the training data set samples according to the ratio of 1: 1 mixing to retrain the model under the condition of ensuring that the model parameters are not changed. The method provided by the invention improves the defense capability of the code annotation generation model to the resisting sample, improves the reliability of the code annotation generation model under the abnormal condition, and further ensures the generated annotation quality.

Description

Code annotation generation model robustness improving method based on deep learning

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to a code annotation generation model robustness improving method based on deep learning.

Background

In software development and maintenance, program understanding is a very time-consuming task; the annotation written in the natural language can greatly reduce the burden of a programmer on understanding the code segment and improve the development efficiency, so that the annotation plays an important role in promoting the program understanding. In recent years, deep learning has not been successful in application fields such as computer vision and natural language processing. Researchers have gradually proposed and made great progress in the use of deep neural networks for the task of automatic generation of code annotations.

However, as the deep learning technology is rapidly developed, the reliability problem of the deep learning model is gradually exposed, and the quality problem of the deep learning model is gradually concerned. Due to the complex structure of the deep learning model, the deep learning model is extremely vulnerable to the threat of confrontation samples, namely, the small disturbance which cannot be found by human in the data can cause the deep learning model to make wrong judgment and output wrong results. The existence of the countermeasure sample undoubtedly poses a huge threat to practical applications, such as payment instruments like a payment treasure, and the existence of the countermeasure sample enables the identification system to falsely identify the payer as other people or even a specific person; the high-speed rail ticket buying station is realized through a face recognition system, and the ticket checking system can be paralyzed by resisting sample attack, and even a criminal can escape from pursuing.

Many researchers at home and abroad discuss and research how to improve the defense capability of the model on the resisting sample so as to improve the robustness of the deep learning model, and a large number of research results emerge, but most of the current work only aims at images, natural languages and the like, and few researches on code related tasks, especially on code annotation generation tasks. Several main aspects regarding the study of improving the robustness of code dependent task models will be presented below.

To improve the defense ability of the deep learning model to the challenge sample, how to generate the challenge sample is first studied. Yefet et al propose a gradient-based countermeasure sample generation method for a method name prediction task, and Yefet et al use a gradient ascent method to perform variable name replacement or dead code insertion on an original code segment, thereby generating a countermeasure sample. Furthermore, yevet introduced a detection-based defense method in their model to detect whether a sample was a challenge sample before it was sent to the model.

Zhang et al generated countermeasure samples for the code classification task by using Metropolis-Hastings algorithm for identifier renaming based on black box approach. Zhang et al introduced countermeasure training based on data enhancement to improve the robustness of the model, mixing the original samples with the clean samples and retraining the model.

Ramakrishnan et al generate a task for code annotation, adopt a gradient-based method, consider eight code transformations to generate a countermeasure sample, and generate the countermeasure sample by using a robust optimization-based countermeasure training method.

For the robustness of a code annotation generation model, related research is still lacked at present, Ramakrishan et al only use a method name in a code as an annotation, and the short annotation of training the model to generate the method name is still generated by the method name and is not written by a programmer.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a code annotation generation model robustness improving method based on deep learning, so as to solve the problems of difficult generation of confrontation samples of code segments, large search space and more query times in the prior art; the method provided by the invention improves the defense capability of the code annotation generation model to the resisting sample, improves the reliability of the code annotation generation model under the abnormal condition, and further ensures the generated annotation quality.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention discloses a code annotation generation model robustness improving method based on deep learning, which comprises the following steps:

step 1) model training: training a deep learning-based code annotation generation model and a deep neural network-based code coding network by using the disclosed data set;

step 2) challenge sample generation: generating a model and a code coding network by using the code annotation trained in the step 1), and generating a countermeasure sample for each piece of data on the training data set in the data set used in the step 1) by adopting a variable name replacement-based method;

step 3), training a new model: matching the confrontation samples on the training set generated in the step 2) with the training data set samples according to the ratio of 1: 1 mixing to ensure that the generated model is annotated by retraining the code under the condition of not changing the model parameters.

Further, the data set disclosed in the step 1) includes: a training data set, a validation data set, and a test data set.

Further, the step 1) specifically includes:

11) for any code annotation generation model based on deep learning, training the code annotation generation model under the condition of not changing original parameters, model structures and data sets (training data sets, verification data sets and test data sets);

12) and carrying out hump decomposition on each code in the training data set, constructing a sequence-to-sequence model based on a long-short term memory network (LSTM), extracting a partial result of an encoder in the model after the sequence-to-sequence model is trained as a code encoding result, and constructing a code encoding network.

Further, the step 2) generates the countermeasure sample by using a black box attack method, which specifically includes:

21) extracting code segments in the training data set used in the step 1), coding each code segment by using the code coding network obtained in the step 1), and expressing each code by using a vector with a fixed dimension;

22) extracting identifiers, local variable names and method names in each code segment for replacement, and ensuring code grammar correctness;

23) the identifiers of each code segment extracted in the step 22) form an identifier set, the similarity of the identifier Id in each code segment with all the identifiers in the identifier set is calculated, and K identifiers with the highest similarity with the identifier Id except the identifier and the form factor of the code are selected as candidate identifiers;

24) selecting an optimal one of the K candidate identifiers corresponding to each identifier selected in the step 23) as an optimal replacement; and determining the replacement order of all identifiers in each code segment;

25) according to the replacement sequence and the best replacement of all identifiers in each code segment determined in the step 24), the first M identifiers are selected for replacement, and during replacement, the position where each identifier appears is replaced, and a countermeasure sample is generated.

Further, the step 3) specifically includes:

31) using the confrontation samples on the training data set generated in the step 2), and calculating the ratio of 1: 1, mixing the sample in the training data set with the proportion of 1, and constructing a new mixed training data set;

32) under the condition that the structure and the model parameters are unchanged when the model is generated by annotating the training codes in the step 1), retraining by using the mixed training data set;

33) and carrying out robustness evaluation on the trained model.

The invention has the beneficial effects that:

the invention generates a confrontation sample based on an identifier replacement method, analyzes the difference of a model between the confrontation sample and an original sample; by utilizing the generated countermeasure sample and combining with a countermeasure training method, the robustness of the code annotation generation model based on deep learning is further improved, and the method has the following advantages:

aiming at the code annotation generation task, the invention adopts a black box attack mode, and the generation of the countermeasure sample does not need to know the internal structure information of the model, and only needs the final output of the interface access model, thereby improving the speed and quality of the generation of the countermeasure sample, and having stronger migration performance.

The countermeasure sample generated by the invention solves the problem of large search space for identifier replacement, and the generated countermeasure sample can ensure the grammar correctness of the program language, simultaneously does not change the semantics of the original program, and ensures the consistent program output before and after conversion.

The robustness of the model is used for evaluating the defense capability of the model to the imperceptible disturbance, the robustness of the model can be well evaluated, and the model trained by the method has higher defense capability, so that the model has higher reliability under abnormal conditions, and the quality of generated annotations is ensured.

Drawings

FIG. 1 is a schematic diagram of the method of the present invention.

FIG. 2 is a schematic diagram of a challenge sample generation algorithm in accordance with the present invention.

Fig. 3 is a diagram of a code-encoded network architecture in accordance with the present invention.

Detailed Description

In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.

Referring to fig. 1, the code annotation generation model robustness improving method based on deep learning according to the present invention includes the following steps:

step 1): selecting any existing code annotation generation model, namely a target model for improving robustness, and training the model under the condition of ensuring that the original environment (a data set, a model structure, training parameters and a training environment) is unchanged; constructing a new code encoding network, and expressing the original code into a vector; wherein,

in order to encode the code better, when a code encoding network is constructed, firstly hump decomposition is carried out on each word in the original code, namely the original word is 'addTo', and bits 'add' and 'to' after decomposition. As shown in fig. 3, a sequence-to-sequence model based on a long-short term memory network is built, and the model is divided into an encoder part and a decoder part. Embedding a Word2Vec training Word, expressing each token after code segmentation as a vector with fixed dimensionality, taking a Word list trained by the Word2Vec as a first layer of a coding part, taking the code as an encoder input, taking the output of a last hidden layer of the encoder as a decoder input, taking a comment as a decoder output, and training a model. After the model is trained, all hidden layers output by the encoder part are extracted, and the code is encoded into a vector with a fixed dimensionality through a Softmax layer.

Step 2): generating a countermeasure sample according to the trained code annotation generation model and the code coding network in the step 1); the code annotation generation model is used for obtaining an output result aiming at a certain code model, evaluating the quality of generated annotation by using a Bleu-four value, and expressing the information of a specific code by using a trained code coding network; as shown in fig. 2, in which,

21) extracting code segments on a training data set in a data set, firstly carrying out hump decomposition processing on each code segment, coding by using a code coding network obtained in the step 1), and expressing codes into 64-dimensional vectors as follows: vec (p) ═ v₁,v₂,...,v₆₄]；

22) For each Java code segment in the training data set, analyzing the code into an abstract syntax tree by using a javalang abstract syntax tree analysis tool, proposing self-defined identifiers (method names and variable names) according to node information in the abstract syntax tree, combining the self-defined identifiers of all codes into a candidate identifier set, and forming a code-identifier dictionary; extracting the form parameter of each code according to the analyzed abstract syntax tree, and establishing a form parameter-code segment dictionary;

23) each code and its identifier extracted in step 22) and the set of candidate identifiers, a sub-candidate identifier is established. Training the embedded representation of each identifier by using Word2Vec, and calculating K identifiers with the highest similarity to each identifier in each code by using cosine similarity as candidate replacement identifiers as follows:

wherein, w_iFor the identifier in each code, w_jRemoving identifiers in the identifier set after the form factors and the identifiers appearing in the code segment are removed from the candidate identifiers; imbedding (w)_i) Represents an identifier w_iWord embedding, embedding (w)_j) Is an identifier w_jWord embedding;

24) according to the code coding network trained in the step 1), for each code, calculating the vector representation of the code, and calculating the distance between each identifier and the code by using the cosine distance:

S(p,w_i)＝cos(vec(p),embedding(w_i))

for K candidate replacement identifiers corresponding to each identifier selected in the step 23), when a certain identifier is replaced by the replacement identifier, all the positions where the identifier appears are replaced, and the change of the model result before and after the replacement is calculated; the most varied is selected as its best replacement identifier, and the formula is as follows:

wherein p is a code before replacement, p' is a code after replacement,

score (p) takes the Bleu-four as the evaluation criterion for the best replacement identifier, and will

Δscore＝score(p)-score(p*)

As an optimal variation; where score (p) is the output of the post-substitution model.

After the optimal change and text similarity of each identifier are calculated, the identifiers in a certain code are reordered by using the following formula:

in the formula,

and sorting the H in a descending order to obtain an identifier replacement order, and selecting the first M identifiers for replacement.

Step 3) comparing the confrontation samples on the training data set generated in step 2) with the samples in the training data set (the training data set used for training the model in step 1) according to the ratio of 1: 1, mixing and disordering the sequence, ensuring that a retraining code is annotated to generate a model under the same training environment and model structure as the original training, and obtaining a new model; and using the samples in the test data set in the data set for training code annotation generation model in the step 1) and the confrontation samples generated on the test data set to evaluate the robustness of the model before and after the confrontation training; the evaluation criteria were:

relative decrease amount:

where y represents the annotation generated by the model on the sample in the test dataset, refs represents the standard annotation of the test dataset sample, and y' represents the annotation generated by the model on the challenge sample generated on the test dataset.

While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A code annotation generation model robustness improving method based on deep learning is characterized by comprising the following steps:

2. The method for improving robustness of code annotation generation model based on deep learning according to claim 1, wherein the data set disclosed in step 1) comprises: a training data set, a validation data set, and a test data set.

3. The method for improving robustness of a code annotation generation model based on deep learning according to claim 1, wherein the step 1) specifically comprises:

11) aiming at any code annotation generation model based on deep learning, training the code annotation generation model under the condition of not changing original parameters, model structures and data sets;

12) and carrying out hump decomposition on each code in the training data set, constructing a sequence-to-sequence model based on a long-term and short-term memory network, extracting a partial result of an encoder in the model after the training of the sequence-to-sequence model as a code encoding result, and constructing a code encoding network.

4. The method for improving robustness of a code annotation generation model based on deep learning according to claim 1, wherein the step 2) adopts a black box attack method to generate the countermeasure sample, and specifically comprises:

5. The method for improving robustness of a code annotation generation model based on deep learning according to claim 1, wherein the step 3) specifically comprises:

33) and carrying out robustness evaluation on the trained model.