CN112162775A

CN112162775A - Java code annotation automatic generation method based on Transformer and mixed code expression

Info

Publication number: CN112162775A
Application number: CN202011129802.4A
Authority: CN
Inventors: 陈翔; 杨光; 刘珂; 田丹; 贾焱鑫; 于池; 胡新宇
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2021-01-01

Abstract

The invention provides a Java code annotation automatic generation method based on Transformer and mixed code representation, which comprises the following steps: s1, downloading Java items and constructing a code library; s2, converting the AST traversal into a code token vector and an SBT vector on the basis of the AST traversal at a serialization processing layer; s3, at the encoding layer, using a Code encoder and an SBT encoder, wherein the Code encoder extracts lexical information from a source Code and uses the SBT encoder to obtain the structure information of the Code; at the decoding layer, the semantic information is decoded to generate a comment S4. The invention has the beneficial effects that: the method of the present invention is for Code annotation generation, particularly encoding Code and AST-based SBT traversal sequences at the encoding level, and merging the learned semantic information of both to capture the semantic information of the source Code.

Description

Java code annotation automatic generation method based on Transformer and mixed code expression

Technical Field

The invention relates to the technical field of computer application, in particular to a Java code annotation automatic generation method based on Transformer and mixed code representation.

Background

In the development and maintenance process of software, the comments corresponding to the codes often have the problems of missing, deficiency or mismatching with the actual content of the codes, but writing the code comments manually wastes time and labor for developers, and the comment quality is difficult to guarantee, so that researchers are urgently needed to provide an effective automatic code comment generation method.

Code annotation generates a natural language description intended to generate source code, which can help developers understand the program, thereby reducing the time cost of software maintenance. Recently, most of the latest technologies utilize a Seq2Seq model based on RNN (recurrent neural network) or CNN (convolutional neural network). However, this approach has certain disadvantages. For example, CNNs cannot be used directly to process variable length sequence samples, while RNNs cannot be computed in parallel and are inefficient.

Disclosure of Invention

The invention aims to provide a Java Code annotation automatic generation method based on Transformer and mixed Code representation, which aims to solve the problems of poor program readability, poor understandability and increased software development and maintenance cost caused by lack of Code annotation in the software development and maintenance process in the prior art, and is used for Code annotation generation, particularly for coding Code and AST-based SBT traversal sequences at a coding layer and combining semantic information learned by the Code and the AST to capture the semantic information of source codes; the invention realizes the automation of code annotation generation, generates concise and accurate annotations for codes, improves the readability and the intelligibility of the codes, reduces the code development and maintenance cost and improves the code development and maintenance efficiency.

The invention is realized by the following measures: a Java code annotation automatic generation method based on Transformer and mixed code representation comprises the following steps:

s1, downloading Java items and constructing a code library;

s2, converting the AST traversal into a code token vector and an SBT vector on the basis of the AST traversal at a serialization processing layer;

to address the vocabulary deficiency issue, identifiers from the code token and the AST node are split into words based on a hump naming method;

s3, at the encoding layer, using a Code encoder and an SBT encoder, wherein the Code encoder extracts lexical information from a source Code and uses the SBT encoder to obtain the structure information of the Code;

at the decoding layer, the semantic information is decoded to generate a comment S4.

As a further optimization scheme of the method for automatically generating Java Code annotations represented by a transform and a mixed Code provided by the present invention, in step S2, two input sequences respectively applicable to a Code encoder and an SBT encoder are generated, and when an input sequence of a Code encoder is generated, the method specifically includes the following steps:

s201, decomposing the identifier name in the source code into a plurality of words by using a hump naming method;

s202, uniformly converting the decomposed words into a lower case format;

s203, the OOV (Out-Of-Vocabulariy) problem is that some rare words or derivative words, words generated by complex numbers Of the words or rules Of other combined words can not be represented by the existing word vector model, and specific numbers and character strings are replaced by "< NUM >" and "< STR >" labels respectively to relieve the OOV problem, so as to obtain an input sequence Of Code Encoding;

s204, resolving the Java method into an abstract syntax tree AST by using a JDT compiler of Eclipse for Java code data, and traversing the abstract syntax tree by using an SBT traversal method to obtain an input sequence of the SBT encoder.

As a further optimized solution of the method for automatically generating Java Code annotations represented by a Transformer and mixed codes provided by the present invention, a Code encoder and an SBT encoder are used in step S3, and information obtained by the two encoders is merged into an output of an encoding layer sequence.

Compared with the prior art, the invention has the beneficial effects that: the invention is a new Code annotation generating method based on mixed Code expression and Transformer, the Transformer can realize better performance than the traditional Seq2Seq (Sequence to Sequence) model, the Code and AST-based SBT traversal Sequence are coded in a coding layer, and the learned semantic information of the Code and the AST-based SBT traversal Sequence is merged to obtain the semantic information of a source Code.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a general flow diagram of the present invention.

FIG. 2 is a graph illustrating a comparison result of different encoder configurations according to the present invention.

FIG. 3 is a graph illustrating a comparison result curve of different encoder configurations according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. Of course, the specific embodiments described herein are merely illustrative of the invention and are not intended to be limiting.

Example 1

Referring to fig. 1 to 3, the technical solution provided by the present invention is a Java code annotation automatic generation method based on a Transformer and mixed code representation, wherein the method includes the following steps:

s1, downloading Java items and constructing a code library;

s202, uniformly converting the decomposed words into a lower case format;

Interpretation of terms:

abstract Syntax spanning Tree (AST): also known as a syntax tree, is an abstract tree representation of the syntax structure of a code, with each node in the tree representing a structure of the code.

Referring to fig. 1, a new code annotation generation method based on mixed code representation and Transformer:

1. gathering of code annotated corpora

1.1, collecting Java method codes and corresponding Javadoc comments from a Github website by means of a crawler;

1.2 remove Java methods that do not contain JavaDoc comments;

1.3 according to the suggestions of JavaDoc, selecting the first sentence of the JavaDoc annotation as the annotation corresponding to the Java method;

1.4 remove Java methods whose comments contain only a single word;

1.5 removing Java methods of the type including the setter/getter method, the build method, the test method, because these methods generate code annotations more easily;

1.6 methods of removing flags with @ SmallTest, @ LargeTest, and @ MediumTest, and reload class

A method of type (I);

2. performing mixed representation on each Java code in the corpus;

2.1 converting the code into a word sequence Seqcode based on the lexical angle of the code;

2.1.1 many words in the code are identifiers (such as class names, method names, variable names, etc.), and in order to better learn the information in the code, the identifiers are further subdivided into a plurality of words according to a hump naming method;

2.1.2 converting all words into lower case forms

2.1.3, the numerical value and the character string in the Code are expressed by special symbols, for example, the numerical value in the Code is expressed by < NUM >, the character string in the Code is expressed by < STR >, so as to relieve the OOV (Out-Of-Vocalburry) problem and obtain the input sequence Of the Code encoder;

2.2 converting the code into SBT input based on the grammar angle of the code to obtain the sequence SeqSBT

2.2.1 Using Eclipse development tool JDT, Java code is converted into an abstract syntax tree. An Abstract Syntax Tree (AST), also called Syntax Tree, is an Abstract Tree representation of the code Syntax structure, where each node in the Tree represents a structure of the code.

2.2.2 the sequence of codes is then generated by traversing the abstract syntax tree using the SBT method. The traversal method can well keep the structure of the abstract syntax tree, and can ensure that the generated code sequence can be accurately restored to the original abstract syntax tree. The SBT (Structure-Based Traversal) method is a new Structure-Based method for traversing AST, with which a sub-tree under a given node is included in a pair of brackets that indicate the Structure of AST, we can accurately translate from sequences generated using SBT to trees;

3. constructing an annotation automatic generation model by means of a Transformer method based on mixed representation of Java codes, training the code annotation generation model by taking a < code, annotation > pair as input of model training, and respectively adding special marks < sos > and < eos > in a training sequence as a start mark and an end mark;

3.1 in the coding layer, through two coders, wherein the Code coder learns the lexical information of the Java Code based on the sequence Seqcode, and the SBT coder learns the grammatical information of the Java Code based on the sequence SeqSBT, finally, the semantic information of the Code can be effectively learned through the two coders. Combining matrix vectors with equal sizes obtained by learning of two encoders, compressing the obtained vectors into the size of a source matrix, increasing the nonlinearity of a neural network by using a TANH (hyper-tangential) activation function, and inputting the nonlinearity into a decoder;

3.2 using position encoding at decoder layer, and combining them with scaled embedded target token by summation element, then performing Dropout processing, Dropout refers to removing the neural network training unit from the network according to certain probability in deep learning training process, and the combined embedding and encoding source, source mask and target mask together pass through 2N decoder layers to obtain the prediction mark in the code annotation Y ^ corresponding to Java code;

3.3 compare Y ^ to the actual marker in target annotation Y to calculate the loss, which will be used to calculate the gradient of the parameter, then we use the adaptive moment estimation optimizer to update our weights to improve the performance of the training model;

4. application of models

4.1 when predicting a new Java code, firstly processing the code by using the method in the step 2 to obtain corresponding initial Seqcode and SeqSBT sequences, carrying out reinforcement and shortening on the two initial sequences, cutting off an over-range part if the code length exceeds a preset value, and filling by using a < pad > tag to obtain a final Seqcode sequence and a SeqSBT sequence if the code length is insufficient;

4.2 putting the two sequences into corresponding encoders respectively, embedding the two sequences into layers through the standard, and carrying out element summation with positionedbudding to obtain a vector containing information about token and the position of the token in the sequences; multiplying token embedding by a scaling factor before they are added

Where d _ model is the hidden layer dimension; the use of the scale factor can effectively reduce the variance in embedding, and then discard the summed embedding, thereby avoiding the overfitting problem.

4.3 the combined embedding together with the encoding source, source mask and target mask results in a target annotation through 2N decoder layers.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A Java code annotation automatic generation method based on Transformer and mixed code representation is characterized by comprising the following steps:

s1, downloading Java items and constructing a code library;

s2, converting the AST traversal into a code token vector and an SBT vector on the basis of the AST traversal at a serialization processing layer; splitting identifiers from the code token and the AST node into a plurality of words based on a hump nomenclature;

2. The method for automatically generating Java Code annotations expressed by Transformer and mixed Code according to claim 1, wherein two input sequences respectively applicable to a Code encoder and an SBT encoder are generated in step S2, and when generating an input sequence of a Code encoder, the method specifically comprises the following steps:

s202, uniformly converting the decomposed words into a lower case format;

3. The method for automatically generating Java Code annotations expressed by Transformer and mixed Code according to claim 1 or 2, wherein a Code encoder and an SBT encoder are used in step S3, and the information obtained by the two encoders is combined to the output of the coding layer sequence.