CN112270358A - Code annotation generation model robustness improving method based on deep learning - Google Patents

Code annotation generation model robustness improving method based on deep learning Download PDF

Info

Publication number
CN112270358A
CN112270358A CN202011178831.XA CN202011178831A CN112270358A CN 112270358 A CN112270358 A CN 112270358A CN 202011178831 A CN202011178831 A CN 202011178831A CN 112270358 A CN112270358 A CN 112270358A
Authority
CN
China
Prior art keywords
code
model
data set
training
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011178831.XA
Other languages
Chinese (zh)
Inventor
周宇
张晓晴
沈娟娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202011178831.XA priority Critical patent/CN112270358A/en
Publication of CN112270358A publication Critical patent/CN112270358A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a code annotation generation model robustness improving method based on deep learning, which comprises the following steps: training a deep learning-based code annotation generation model and a deep neural network-based code coding network by using the disclosed data set; generating a countermeasure sample by using the trained code annotation generation model and the trained code coding network and adopting a variable name replacement-based method for each piece of data on a training data set in the used data set; and (3) matching the confrontation samples on the generated training set with the training data set samples according to the ratio of 1: 1 mixing to retrain the model under the condition of ensuring that the model parameters are not changed. The method provided by the invention improves the defense capability of the code annotation generation model to the resisting sample, improves the reliability of the code annotation generation model under the abnormal condition, and further ensures the generated annotation quality.

Description

Code annotation generation model robustness improving method based on deep learning
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a code annotation generation model robustness improving method based on deep learning.
Background
In software development and maintenance, program understanding is a very time-consuming task; the annotation written in the natural language can greatly reduce the burden of a programmer on understanding the code segment and improve the development efficiency, so that the annotation plays an important role in promoting the program understanding. In recent years, deep learning has not been successful in application fields such as computer vision and natural language processing. Researchers have gradually proposed and made great progress in the use of deep neural networks for the task of automatic generation of code annotations.
However, as the deep learning technology is rapidly developed, the reliability problem of the deep learning model is gradually exposed, and the quality problem of the deep learning model is gradually concerned. Due to the complex structure of the deep learning model, the deep learning model is extremely vulnerable to the threat of confrontation samples, namely, the small disturbance which cannot be found by human in the data can cause the deep learning model to make wrong judgment and output wrong results. The existence of the countermeasure sample undoubtedly poses a huge threat to practical applications, such as payment instruments like a payment treasure, and the existence of the countermeasure sample enables the identification system to falsely identify the payer as other people or even a specific person; the high-speed rail ticket buying station is realized through a face recognition system, and the ticket checking system can be paralyzed by resisting sample attack, and even a criminal can escape from pursuing.
Many researchers at home and abroad discuss and research how to improve the defense capability of the model on the resisting sample so as to improve the robustness of the deep learning model, and a large number of research results emerge, but most of the current work only aims at images, natural languages and the like, and few researches on code related tasks, especially on code annotation generation tasks. Several main aspects regarding the study of improving the robustness of code dependent task models will be presented below.
To improve the defense ability of the deep learning model to the challenge sample, how to generate the challenge sample is first studied. Yefet et al propose a gradient-based countermeasure sample generation method for a method name prediction task, and Yefet et al use a gradient ascent method to perform variable name replacement or dead code insertion on an original code segment, thereby generating a countermeasure sample. Furthermore, yevet introduced a detection-based defense method in their model to detect whether a sample was a challenge sample before it was sent to the model.
Zhang et al generated countermeasure samples for the code classification task by using Metropolis-Hastings algorithm for identifier renaming based on black box approach. Zhang et al introduced countermeasure training based on data enhancement to improve the robustness of the model, mixing the original samples with the clean samples and retraining the model.
Ramakrishnan et al generate a task for code annotation, adopt a gradient-based method, consider eight code transformations to generate a countermeasure sample, and generate the countermeasure sample by using a robust optimization-based countermeasure training method.
For the robustness of a code annotation generation model, related research is still lacked at present, Ramakrishan et al only use a method name in a code as an annotation, and the short annotation of training the model to generate the method name is still generated by the method name and is not written by a programmer.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a code annotation generation model robustness improving method based on deep learning, so as to solve the problems of difficult generation of confrontation samples of code segments, large search space and more query times in the prior art; the method provided by the invention improves the defense capability of the code annotation generation model to the resisting sample, improves the reliability of the code annotation generation model under the abnormal condition, and further ensures the generated annotation quality.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the invention discloses a code annotation generation model robustness improving method based on deep learning, which comprises the following steps:
step 1) model training: training a deep learning-based code annotation generation model and a deep neural network-based code coding network by using the disclosed data set;
step 2) challenge sample generation: generating a model and a code coding network by using the code annotation trained in the step 1), and generating a countermeasure sample for each piece of data on the training data set in the data set used in the step 1) by adopting a variable name replacement-based method;
step 3), training a new model: matching the confrontation samples on the training set generated in the step 2) with the training data set samples according to the ratio of 1: 1 mixing to ensure that the generated model is annotated by retraining the code under the condition of not changing the model parameters.
Further, the data set disclosed in the step 1) includes: a training data set, a validation data set, and a test data set.
Further, the step 1) specifically includes:
11) for any code annotation generation model based on deep learning, training the code annotation generation model under the condition of not changing original parameters, model structures and data sets (training data sets, verification data sets and test data sets);
12) and carrying out hump decomposition on each code in the training data set, constructing a sequence-to-sequence model based on a long-short term memory network (LSTM), extracting a partial result of an encoder in the model after the sequence-to-sequence model is trained as a code encoding result, and constructing a code encoding network.
Further, the step 2) generates the countermeasure sample by using a black box attack method, which specifically includes:
21) extracting code segments in the training data set used in the step 1), coding each code segment by using the code coding network obtained in the step 1), and expressing each code by using a vector with a fixed dimension;
22) extracting identifiers, local variable names and method names in each code segment for replacement, and ensuring code grammar correctness;
23) the identifiers of each code segment extracted in the step 22) form an identifier set, the similarity of the identifier Id in each code segment with all the identifiers in the identifier set is calculated, and K identifiers with the highest similarity with the identifier Id except the identifier and the form factor of the code are selected as candidate identifiers;
24) selecting an optimal one of the K candidate identifiers corresponding to each identifier selected in the step 23) as an optimal replacement; and determining the replacement order of all identifiers in each code segment;
25) according to the replacement sequence and the best replacement of all identifiers in each code segment determined in the step 24), the first M identifiers are selected for replacement, and during replacement, the position where each identifier appears is replaced, and a countermeasure sample is generated.
Further, the step 3) specifically includes:
31) using the confrontation samples on the training data set generated in the step 2), and calculating the ratio of 1: 1, mixing the sample in the training data set with the proportion of 1, and constructing a new mixed training data set;
32) under the condition that the structure and the model parameters are unchanged when the model is generated by annotating the training codes in the step 1), retraining by using the mixed training data set;
33) and carrying out robustness evaluation on the trained model.
The invention has the beneficial effects that:
the invention generates a confrontation sample based on an identifier replacement method, analyzes the difference of a model between the confrontation sample and an original sample; by utilizing the generated countermeasure sample and combining with a countermeasure training method, the robustness of the code annotation generation model based on deep learning is further improved, and the method has the following advantages:
aiming at the code annotation generation task, the invention adopts a black box attack mode, and the generation of the countermeasure sample does not need to know the internal structure information of the model, and only needs the final output of the interface access model, thereby improving the speed and quality of the generation of the countermeasure sample, and having stronger migration performance.
The countermeasure sample generated by the invention solves the problem of large search space for identifier replacement, and the generated countermeasure sample can ensure the grammar correctness of the program language, simultaneously does not change the semantics of the original program, and ensures the consistent program output before and after conversion.
The robustness of the model is used for evaluating the defense capability of the model to the imperceptible disturbance, the robustness of the model can be well evaluated, and the model trained by the method has higher defense capability, so that the model has higher reliability under abnormal conditions, and the quality of generated annotations is ensured.
Drawings
FIG. 1 is a schematic diagram of the method of the present invention.
FIG. 2 is a schematic diagram of a challenge sample generation algorithm in accordance with the present invention.
Fig. 3 is a diagram of a code-encoded network architecture in accordance with the present invention.
Detailed Description
In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.
Referring to fig. 1, the code annotation generation model robustness improving method based on deep learning according to the present invention includes the following steps:
step 1): selecting any existing code annotation generation model, namely a target model for improving robustness, and training the model under the condition of ensuring that the original environment (a data set, a model structure, training parameters and a training environment) is unchanged; constructing a new code encoding network, and expressing the original code into a vector; wherein the content of the first and second substances,
in order to encode the code better, when a code encoding network is constructed, firstly hump decomposition is carried out on each word in the original code, namely the original word is 'addTo', and bits 'add' and 'to' after decomposition. As shown in fig. 3, a sequence-to-sequence model based on a long-short term memory network is built, and the model is divided into an encoder part and a decoder part. Embedding a Word2Vec training Word, expressing each token after code segmentation as a vector with fixed dimensionality, taking a Word list trained by the Word2Vec as a first layer of a coding part, taking the code as an encoder input, taking the output of a last hidden layer of the encoder as a decoder input, taking a comment as a decoder output, and training a model. After the model is trained, all hidden layers output by the encoder part are extracted, and the code is encoded into a vector with a fixed dimensionality through a Softmax layer.
Step 2): generating a countermeasure sample according to the trained code annotation generation model and the code coding network in the step 1); the code annotation generation model is used for obtaining an output result aiming at a certain code model, evaluating the quality of generated annotation by using a Bleu-four value, and expressing the information of a specific code by using a trained code coding network; as shown in fig. 2, in which,
21) extracting code segments on a training data set in a data set, firstly carrying out hump decomposition processing on each code segment, coding by using a code coding network obtained in the step 1), and expressing codes into 64-dimensional vectors as follows: vec (p) ═ v1,v2,...,v64];
22) For each Java code segment in the training data set, analyzing the code into an abstract syntax tree by using a javalang abstract syntax tree analysis tool, proposing self-defined identifiers (method names and variable names) according to node information in the abstract syntax tree, combining the self-defined identifiers of all codes into a candidate identifier set, and forming a code-identifier dictionary; extracting the form parameter of each code according to the analyzed abstract syntax tree, and establishing a form parameter-code segment dictionary;
23) each code and its identifier extracted in step 22) and the set of candidate identifiers, a sub-candidate identifier is established. Training the embedded representation of each identifier by using Word2Vec, and calculating K identifiers with the highest similarity to each identifier in each code by using cosine similarity as candidate replacement identifiers as follows:
Figure BDA0002749525310000041
wherein, wiFor the identifier in each code, wjRemoving identifiers in the identifier set after the form factors and the identifiers appearing in the code segment are removed from the candidate identifiers; imbedding (w)i) Represents an identifier wiWord embedding, embedding (w)j) Is an identifier wjWord embedding;
24) according to the code coding network trained in the step 1), for each code, calculating the vector representation of the code, and calculating the distance between each identifier and the code by using the cosine distance:
S(p,wi)=cos(vec(p),embedding(wi))
for K candidate replacement identifiers corresponding to each identifier selected in the step 23), when a certain identifier is replaced by the replacement identifier, all the positions where the identifier appears are replaced, and the change of the model result before and after the replacement is calculated; the most varied is selected as its best replacement identifier, and the formula is as follows:
Figure BDA0002749525310000051
wherein p is a code before replacement, p' is a code after replacement,
Figure BDA0002749525310000052
score (p) takes the Bleu-four as the evaluation criterion for the best replacement identifier, and will
Δscore=score(p)-score(p*)
As an optimal variation; where score (p) is the output of the post-substitution model.
After the optimal change and text similarity of each identifier are calculated, the identifiers in a certain code are reordered by using the following formula:
Figure BDA0002749525310000053
in the formula (I), the compound is shown in the specification,
Figure BDA0002749525310000054
and sorting the H in a descending order to obtain an identifier replacement order, and selecting the first M identifiers for replacement.
Step 3) comparing the confrontation samples on the training data set generated in step 2) with the samples in the training data set (the training data set used for training the model in step 1) according to the ratio of 1: 1, mixing and disordering the sequence, ensuring that a retraining code is annotated to generate a model under the same training environment and model structure as the original training, and obtaining a new model; and using the samples in the test data set in the data set for training code annotation generation model in the step 1) and the confrontation samples generated on the test data set to evaluate the robustness of the model before and after the confrontation training; the evaluation criteria were:
relative decrease amount:
Figure BDA0002749525310000055
where y represents the annotation generated by the model on the sample in the test dataset, refs represents the standard annotation of the test dataset sample, and y' represents the annotation generated by the model on the challenge sample generated on the test dataset.
While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (5)

1. A code annotation generation model robustness improving method based on deep learning is characterized by comprising the following steps:
step 1) model training: training a deep learning-based code annotation generation model and a deep neural network-based code coding network by using the disclosed data set;
step 2) challenge sample generation: generating a model and a code coding network by using the code annotation trained in the step 1), and generating a countermeasure sample for each piece of data on the training data set in the data set used in the step 1) by adopting a variable name replacement-based method;
step 3), training a new model: matching the confrontation samples on the training set generated in the step 2) with the training data set samples according to the ratio of 1: 1 mixing to ensure that the generated model is annotated by retraining the code under the condition of not changing the model parameters.
2. The method for improving robustness of code annotation generation model based on deep learning according to claim 1, wherein the data set disclosed in step 1) comprises: a training data set, a validation data set, and a test data set.
3. The method for improving robustness of a code annotation generation model based on deep learning according to claim 1, wherein the step 1) specifically comprises:
11) aiming at any code annotation generation model based on deep learning, training the code annotation generation model under the condition of not changing original parameters, model structures and data sets;
12) and carrying out hump decomposition on each code in the training data set, constructing a sequence-to-sequence model based on a long-term and short-term memory network, extracting a partial result of an encoder in the model after the training of the sequence-to-sequence model as a code encoding result, and constructing a code encoding network.
4. The method for improving robustness of a code annotation generation model based on deep learning according to claim 1, wherein the step 2) adopts a black box attack method to generate the countermeasure sample, and specifically comprises:
21) extracting code segments in the training data set used in the step 1), coding each code segment by using the code coding network obtained in the step 1), and expressing each code by using a vector with a fixed dimension;
22) extracting identifiers, local variable names and method names in each code segment for replacement, and ensuring code grammar correctness;
23) the identifiers of each code segment extracted in the step 22) form an identifier set, the similarity of the identifier Id in each code segment with all the identifiers in the identifier set is calculated, and K identifiers with the highest similarity with the identifier Id except the identifier and the form factor of the code are selected as candidate identifiers;
24) selecting an optimal one of the K candidate identifiers corresponding to each identifier selected in the step 23) as an optimal replacement; and determining the replacement order of all identifiers in each code segment;
25) according to the replacement sequence and the best replacement of all identifiers in each code segment determined in the step 24), the first M identifiers are selected for replacement, and during replacement, the position where each identifier appears is replaced, and a countermeasure sample is generated.
5. The method for improving robustness of a code annotation generation model based on deep learning according to claim 1, wherein the step 3) specifically comprises:
31) using the confrontation samples on the training data set generated in the step 2), and calculating the ratio of 1: 1, mixing the sample in the training data set with the proportion of 1, and constructing a new mixed training data set;
32) under the condition that the structure and the model parameters are unchanged when the model is generated by annotating the training codes in the step 1), retraining by using the mixed training data set;
33) and carrying out robustness evaluation on the trained model.
CN202011178831.XA 2020-10-29 2020-10-29 Code annotation generation model robustness improving method based on deep learning Pending CN112270358A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011178831.XA CN112270358A (en) 2020-10-29 2020-10-29 Code annotation generation model robustness improving method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011178831.XA CN112270358A (en) 2020-10-29 2020-10-29 Code annotation generation model robustness improving method based on deep learning

Publications (1)

Publication Number Publication Date
CN112270358A true CN112270358A (en) 2021-01-26

Family

ID=74345269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011178831.XA Pending CN112270358A (en) 2020-10-29 2020-10-29 Code annotation generation model robustness improving method based on deep learning

Country Status (1)

Country Link
CN (1) CN112270358A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113282336A (en) * 2021-06-11 2021-08-20 重庆大学 Code abstract integration method based on quality assurance framework
CN115905926A (en) * 2022-12-09 2023-04-04 华中科技大学 Code classification deep learning model interpretation method and system based on sample difference

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113282336A (en) * 2021-06-11 2021-08-20 重庆大学 Code abstract integration method based on quality assurance framework
CN113282336B (en) * 2021-06-11 2023-11-10 重庆大学 Code abstract integration method based on quality assurance framework
CN115905926A (en) * 2022-12-09 2023-04-04 华中科技大学 Code classification deep learning model interpretation method and system based on sample difference
CN115905926B (en) * 2022-12-09 2024-05-28 华中科技大学 Code classification deep learning model interpretation method and system based on sample difference

Similar Documents

Publication Publication Date Title
Wang et al. Denseran for offline handwritten chinese character recognition
CN110717324B (en) Judgment document answer information extraction method, device, extractor, medium and equipment
CN111459491B (en) Code recommendation method based on tree neural network
CN113190849B (en) Webshell script detection method and device, electronic equipment and storage medium
CN112487812A (en) Nested entity identification method and system based on boundary identification
CN112306494A (en) Code classification and clustering method based on convolution and cyclic neural network
CN112270358A (en) Code annotation generation model robustness improving method based on deep learning
CN112507337A (en) Implementation method of malicious JavaScript code detection model based on semantic analysis
CN112200664A (en) Repayment prediction method based on ERNIE model and DCNN model
CN116796251A (en) Poor website classification method, system and equipment based on image-text multi-mode
CN115408488A (en) Segmentation method and system for novel scene text
CN116304042A (en) False news detection method based on multi-modal feature self-adaptive fusion
CN116661805A (en) Code representation generation method and device, storage medium and electronic equipment
CN110489762A (en) Terminology Translation method, storage medium and device based on neural network machine translation
CN117152504A (en) Space correlation guided prototype distillation small sample classification method
CN117235275A (en) Medical disease coding mapping method and device based on large language model reasoning
Vijayaraju Image retrieval using image captioning
CN105718914A (en) Face coding and identification method
CN112035670B (en) Multi-modal rumor detection method based on image emotional tendency
CN112434516B (en) Self-adaptive comment emotion analysis system and method for merging text information
CN110427615B (en) Method for analyzing modification tense of financial event based on attention mechanism
CN118070783A (en) Text intelligent correction method, system and equipment based on large language model
CN118038497A (en) SAM-based text information driven pedestrian retrieval method and system
CN117370594A (en) Distributed difference self-adaptive image retrieval method based on space-frequency interaction
Guo et al. TransWeaver: Weave Image Pairs for Class Agnostic Common Object Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination