WO2019223804A1

WO2019223804A1 - Method for generation of code annotations based on program analysis and recurrent neural network

Info

Publication number: WO2019223804A1
Application number: PCT/CN2019/088516
Authority: WO
Inventors: 周宇; 闫鑫; 黄志球
Original assignee: 南京航空航天大学
Priority date: 2018-12-21
Filing date: 2019-05-27
Publication date: 2019-11-28
Also published as: CN109783079A

Abstract

Disclosed is a method for generation of code annotations based on program analysis and a recurrent neural network, comprising the following steps: building a large-scale code library; extracting information included in each Java method within a Java project and dependency information thereof; according to the extracted information and in combination with a heuristic method, filtering and reconstructing the execution code portion of each Java method; obtaining annotations matching the execution code; assembling the filtered code and the corresponding annotations into a code/annotation pair set, and using same as a training set for a code annotation generation model; by means of the obtained training set, using an encoding-decoding model to perform code annotation generation model training; after the model training is complete, performing prediction. The method generates simple and clear annotations, can help developers to understand code functions, accelerates software maintenance processes, and increases software product quality.

Description

Method for generating code comments based on program analysis and recurrent neural network

Technical field

The invention belongs to the field of software engineering technology, and particularly relates to a method for automatically generating code comments for a Java method using program static analysis, natural language processing, and neural network technology.

Background technique

With the continuous deepening of computer applications, software has gradually penetrated and integrated into all areas of the national economy, and the software ecosystem has undergone profound changes. New software forms and development models have continuously emerged. Its scale and number are expanding at an alarming rate. "This trend has become increasingly clear. The sharp increase in social demand has brought new challenges and opportunities to software productivity at this stage.

Large-scale empirical research shows that more than 60% of software engineering resources are used for software maintenance. Software maintenance is the process of modifying a software system after delivery to fix errors, improve performance, or adapt to a changing environment. Software maintenance requires code understanding, and reading and understanding the source code is a prerequisite for any modification. Program understanding is time consuming and consumes most developer time. Developers often use integrated development environments, debuggers, and tools for code search, code testing, and program understanding to reduce tedious tasks. If the corresponding code does not have accompanying documentation explanation, due to the potential difference of the code's situation, it will bring an extra burden of understanding, and may even lead to the wrong use of the code, which will reduce the development efficiency, consume software development and maintenance resources, and even Will affect later software quality. Documentation, as an element of software, is an important means to assist in code understanding. With program developers using unfamiliar code or application programming interfaces (APIs), accurate documentation has become a key factor affecting the usability of these codes or APIs. Therefore, if the code lacks the accompanying documentation explanation, it will inevitably bring an additional understanding burden to the programmer: after all, the developer's understanding of the code is a time-consuming task. Compared with the rapid increase in code size and complexity, developers' understanding of the program has not increased in parallel, and the importance of documents has become more prominent. Searching for relevant documents from the Internet and recommending them to users can effectively speed up the code understanding process, which can indirectly Improve software development productivity. However, not every code snippet has a corresponding summary explanation. Secondly, similar to the code search recommendation, the document also faces a large amount of information. It is difficult to find a directly related document explanation through a general search engine, although there are currently some tools such as Doxygen7, Javadoc and others can generate structured documents based on markup information such as annotations, but their content information depends on the user to fill in, and does not fall into the category of automated code annotation generation technology. Therefore, researching new code digest generation methods to alleviate the problem of information overload in the era of Internet big data has become an urgent need for current software development and maintenance personnel. Furthermore, from the perspective of the relevant papers of the software engineering flagship and the specialized conferences in the field, every year ICSE, ESEC / FSE, ASE, ICSME, SANER, and MSR have published a large number of papers (such as API) and supporting document generation. Therefore, in the era of big data in the context of the prosperity of the Internet and open source software ecosystem, code annotation generation has received more and more attention and has become a popular research area. In addition, code annotation generation technology is undoubtedly possessing Important theoretical and practical value.

Summary of the Invention

In view of the above-mentioned shortcomings of the prior art, an object of the present invention is to provide a method for generating code comments based on program analysis and recurrent neural network, so as to solve the problems caused by lack of code comments in the software development and maintenance process in the prior art. The problem of poor program readability, poor understandability, and increased software development and maintenance costs. The invention realizes the automation of code comment generation, generates concise and accurate comments for the code, improves the readability and understandability of the code, reduces the cost of code development and maintenance, and improves the efficiency of code development and maintenance.

In order to achieve the above objective, the technical solution adopted by the present invention is as follows:

A method for generating code comments based on program analysis and recurrent neural network according to the present invention includes the following steps:

(1) Download the Java project and build the code base;

(2) Extracting each Java method in the Java project itself contains information and its dependent information;

(3) According to the information extracted in the above step (2), combine the heuristic method to filter and reconstruct the execution code part of each Java method;

(4) For each Java method in the Java project, analyze its Javadoc, set a template for filtering Javadoc, and combine the part-of-speech tagging method to filter the Javadoc to obtain a comment that matches the executed code;

(5) Combine the filtered code and matching comments into a set of <code, comment> pairs, as a training set for the code comment generation model;

(6) The training set obtained in step (5) uses the encoding-decoding model to train the code annotation generation model;

(7) After the model training is completed, prediction is performed, and a Java method execution code part is given, and corresponding comments are generated.

Further, in the step (2), each Java method in the Java project itself contains information and its dependent information are extracted by parsing the abstract syntax tree. The abstract syntax tree is a tree-like representation of the abstract syntax structure of the source code.

Further, in the step (2), the Java method itself includes information including a method name, local variable name information, local variable type information, constant value (for example, a string constant), and method call information; Java method dependency information includes Class member variable information, method declaration information and qualified name information corresponding to method calls.

Further, in the step (3), the execution code refers to program code that implements a Java method function.

Further, in the step (3), the heuristic method refers to setting replacement and reconstruction rules for executing code in the Java method, which are used to implement constant value replacement, loop and conditional structure reconstruction, and method call information. Substitution, variable name and variable type replacement to implement filtering and refactoring of Java methods.

Further, in the step (4), Javadoc refers to an application programming interface (API) help document corresponding to each Java method, which is a document having a semi-structured feature.

Further, in the step (4), the part-of-speech tagging method is a process of determining the grammatical category of each word in a given sentence, determining its part-of-speech, and tagging.

Further, in the step (5), the correspondence between the <code, comment> and the code embodying the Java method and the comment exists as a training set of the code comment generation model.

Further, in the step (6), the Encoder-Decoder model is a model for neural network machine translation. The encoding-decoding model includes two parts: one is an encoder, which is used to convert the input The sequence is mapped to a vector of a fixed dimension; the other part is a decoder for decoding a vector of a fixed dimension to output a target sequence.

Secondly, the present invention provides a computer-readable storage medium that stores a computer program. When the program is executed by a processor, the following method for generating code comments based on program analysis and recurrent neural network can be implemented:

(1) Download the Java project and build the code base;

Abstract syntax tree analysis is used to extract each Java method itself contains information and its dependent information. The abstract syntax tree is a tree-like representation of the abstract syntax structure of the source code;

The Java method itself contains information including method name, local variable name information, local variable type information, constant value (for example, string constant), and method call information; Java method dependency information includes class member variable information and method declaration corresponding to the method call Information and qualified name information;

Execution code refers to program code that implements the functions of Java methods; heuristic methods refer to the rules for replacing and restructuring execution code in Java methods, which are used to implement constant value replacement, loop and conditional structure reconstruction, and method calls Replacement of information, replacement of variable names and variable types to achieve filtering and refactoring of Java methods;

Javadoc refers to the application programming interface (API) help document corresponding to each Java method, which is a kind of document with semi-structural features; the part-of-speech tagging method is to determine the grammatical category of each word in a given sentence, determine its part of speech, and Process of marking

<Code, annotation> The correspondence between code and annotations embodying a Java method exists as a training set for the code annotation generation model;

The Encoder-Decoder model is a model for neural network machine translation. The encoding-decoding model contains two parts: one is the encoder, which is used to map the input sequence to a vector of a fixed dimension; the other is Is a decoder for decoding a vector of a fixed dimension to output a target sequence;

Further, the computer-readable storage medium further includes a class library on which the program runs, a Java project collection, and a pre-trained code comment generation model.

Third, the present invention provides a code comment generating terminal, which includes one or more processors and a memory for storing one or more programs; when the one or more programs are processed by one or more processors When executed, make one or more processors implement the following code comment generation method based on program analysis and recurrent neural network:

(1) Download the Java project and build the code base;

The above code comment generating terminal preferably includes one or more processors of Intel i7-6700, one GEFORCE GTX 1070 Ti graphics card, two 16G DDR4 main memories and two blocks of 2T random for storing one or more programs. Access memory; it is configured with a program running environment, preferably including a Linux operating system, JDK (Java Development Kit) installation and environment configuration, a Python 3.6 running environment, and a Tensorflow environment configuration to support the running of the program.

The beneficial effects of the present invention:

The present invention mainly utilizes program static analysis, natural language processing, neural network and other technologies to analyze and implement automatic code annotation generation to assist developers in understanding the code, enhance the readability and understandability of the code, reduce the burden of manual understanding, and reduce software. Development and maintenance costs.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram of the preprocessing of the execution code and Javadoc in the source code according to the present invention.

Detailed ways

In order to facilitate the understanding of those skilled in the art, the present invention is further described below with reference to the embodiments and the accompanying drawings. The content mentioned in the embodiments is not a limitation on the present invention.

Referring to FIG. 1, a method for generating code comments based on program analysis and recurrent neural network of the present invention includes the following contents:

(1) Construction of the code base

Neural network model training is data-driven. The ultimate goal of the present invention is to train a neural network-based code annotation generation model. Therefore, a large-scale code base needs to be built to meet the needs of model training. Download 6705 Java projects from the open source community GitHub for building code bases. The Java project is parsed by using an abstract syntax tree, and Java methods are extracted therefrom, and a Javadoc corresponding to each Java method is also extracted.

(2) Code information extraction

Generate comments for Java methods. Code information refers to information related to Java methods. Since the Java method does not exist separately, the code information consists of two parts: the Java method itself contains information and its dependent information. The extraction method is to use the abstract syntax tree module of Eclipse and JDT to analyze the abstract syntax tree of the Java project. Eclipse JDT is an open source Java development tool; it provides a wealth of application programming interfaces (APIs) designed to implement code parsing tasks. Utilizing the many APIs in Eclipse and JDT can successfully complete the code information extraction.

First, the Java method itself contains information extraction: in order to smooth the process of code filtering and refactoring modules, some Java method itself must be extracted to include information, including method names, local variable name information, local variable type information, and constant values (for example, characters String constants) and method call information. By calling the relevant API in Eclipse and JDT, the abstract syntax tree analysis of the Java project is performed to achieve the extraction of the above information.

Second, Java method dependency information extraction: When a Java method lacks its corresponding dependency information, the Java method may not be understood or meaningless. For example, a Java method involves the use of class member variables. Obviously, the declaration of a class member variable does not exist in the method body. Only a code segment of a single existing Java method is given, and the types of some class member variables in the Java method cannot be obtained. And initial value information, which makes it harder to understand the program. Class member variable information belongs to dependency information. Due to the lack of dependency information, developers may not fully understand the code. Java method dependency information mainly includes class member variable information, method declaration information and qualified name information corresponding to method calls. To achieve the extraction of dependency information, you need to treat the Java project as a whole, and then call the API provided by Eclipse JDT to parse the abstract syntax tree of the Java project, traverse the abstract syntax tree corresponding to each Java method, and then extract the corresponding dependency information.

(3) Code filtering and refactoring

With the above extracted code information, the code filtering and reconstruction process can proceed smoothly. Code filtering and refactoring mainly include:

31) Constant value replacement: Many constant values exist in the form of aliases in Java methods. The use of aliases will increase the vocabulary. Extract the constant value information from the Java method to restore the alias to the corresponding constant value.

In the example, the constant values are divided into the following categories: numeric constants, string constants, and character constants. By using the dependency information between Java files in the Java project and the abstract syntax tree parsing, constant value information can be extracted, and the work of constant value replacement can be smoothly advanced.

32) Class member variable information supplement: Because Java is an object-oriented programming language, Java methods usually have dependency information. In this part, we need to solve the problem of missing information about class member variables. In a given Java method, the use of class member variables may be involved, and the class member variable declaration does not exist in the method body, which means that only a code segment of a Java method that exists independently cannot be obtained. Type and initial value information of some class member variables in Java methods. Therefore, by analyzing the Java file to parse the declaration information of the related class member variables, and then supplement the missing dependency information. In order to avoid introducing too much redundant information, you need to focus only on the type information without supplementing the initialization values of class member variable declarations. In order to distinguish between class member variable declarations and local variable declarations in the current Java method, define a template (see FieldAccess section in Table 1) to save class member variable information; Table 1 is as follows:

Table 1

33) Fully qualified name substitution: Due to dependencies between Java files in a Java project, a fully qualified name is usually introduced. It is not possible to analyze only the current Java class file to achieve the value referred to by the fully qualified name. The entire Java project needs to be analyzed for dependency information. It is worth noting that substitutions are made only if the corresponding fully qualified name is a constant value (for example, a string constant).

34) Method call refactoring: In the abstract syntax tree, a method call node can be expressed as follows:

[Expression.] Identifier ([Expression {, Expression}])}

In order to understand the above form, it is simplified as follows:

[VarName / QualifiedClass.]

Identifier

([ParamName {, ParamName}])}

The first line in the above form may not exist, it is a variable name or a fully qualified class name. The second line refers to the Java method name. The third line is a list of parameter names. Based on the above form, set up two templates to reconstruct the method call. If a method call does not exist in the VarName section, match the method call with a template of the form:

QualifiedClass.Identifier

(ParamType {, ParamType}]) ([ParamName {, ParamName}])

Otherwise, match the method call with the second template, which has the form:

(QualifiedClass) VarName.Identifier

(ParamType {, ParamType}]) ([ParamName {, ParamName}]).

35) TryCatch filtering: The try statement in a Java method can help catch exceptions without using the keyword throw to exit the current Java method. The exception handler appears after the try statement and is identified by the catch keyword. Considering that the catch clause has nothing to do with code comments, we only focus on the body of the try statement and ignore the catch clause. As a result, the length of the code sequence can be reduced, the size of the vocabulary can be reduced, and redundant information can be reduced in order to build a high-quality data set.

36) Loop and conditional statement refactoring: Some loops and conditional statements play a vital role in Java methods. To emphasize the importance of these statements, the if statements and for statements were selected and reconstructed through the templates in Table 1. The if statement is set to match the template corresponding to IfStatement, and the for statement containing two styles is set to match the templates corresponding to ForStatement and EnhancedForStatement, respectively.

In order to explain the method of loop and conditional sentence reconstruction, a reconstruction algorithm of the for statement is presented in Table 2. The reconstruction algorithms of other loops and conditional sentences are similar. Table 2 is as follows:

Table 2

37) Identifier replacement: An identifier replacement mechanism is introduced to reduce the vocabulary of the code. Specifically, replace identifiers in Java methods with some specific tags. First, all identifiers in the Java method are sorted by frequency of occurrence, and the top 30,000 identifiers with the highest frequency of occurrence are selected as the code vocabulary. Then, replace those identifiers beyond those in the code vocabulary. As for the replacement operation, it can be divided into six categories, including method name replacement, method call replacement, constant value replacement, variable type replacement, variable name replacement, and method declaration replacement. Accordingly, some special tags are introduced as replacement tags, see Table 2. Considering that there is only one method name for a Java method, only a fixed tag <METHODNAME> is added as a replacement tag for the method name. It is considered meaningless to distinguish string constant values, so the fixed tag STRINGLITERAL is used as a replacement tag for string constant values. Similarly, replace character constants with the tag <CHARACTERLITERAL>. As for the other special tags, they all contain a variable i, a non-negative integer, which is intended to distinguish each other. After the identifier replacement operation, identifiers beyond the code vocabulary will be replaced by the above tokens. In the constructed dataset, the final vocabulary size of the code is 30,351. Table 3 is as follows

table 3

38) Method entry filtering: Considering the constructor, its purpose is to create an instance object for a certain class. It is very simple and easy for developers to read and understand these Java methods, so there is no need to include these Java methods as part of the dataset. Therefore, remove all constructors from the dataset. Also, getter methods, setter methods, and test methods are eliminated. In addition, the length of a Java method's code sequence is limited to between 10 and 400. The Java method with a code sequence length of less than 10 is too simple, and the Java method with a code sequence length of more than 400 is too complicated. It is not suitable to exist in the data set, so it is filtered from the data set.

(4) Javadoc filtering

In order to get the <code, comment> pair, Javadoc filtering also plays an important role. The dataset is derived from numerous Java projects on GitHub. It is not difficult to imagine that the quality of Javadoc for Java methods is uneven. Considering that the neural network model is data-driven, it is necessary to perform filtering operations for Javadoc and finally build a clean and clear data set.

The first sentence of Javadoc usually expresses the meaning of the entire Java method. Therefore, the first sentence of the Javadoc was chosen as a comment on the Java method. Methods without Javadoc will not be included as part of the dataset. It is not enough to simply use the first sentence in the Javadoc as a comment on a Java method. You need to perform filtering operations on the acquired comments based on this.

41) Filter the comments by setting a template: Although some Java methods have comments, their corresponding comments cannot provide any valid information to help the program understand. First, many comments indicate that current Java methods are used for testing, debugging, or not being implemented. Second, some Java method annotations are automatically generated by some tools. Beyond that, many annotations contain warnings to tell them not to use these annotations or Java methods. Faced with these situations, a template for comment filtering is defined, as shown in Table 4, which aims to improve the quality of the code comments finally obtained.

Table 4

42) Use part-of-speech tagging technology to filter annotations: In order to further improve the quality of the data set, use part-of-speech tagging (POS) technology to further filter the annotations. If a comment does not contain a verb, then this comment is not sufficient as a functional description of a Java method. Such a Java method needs to be filtered out from the data set. In order to use part-of-speech tagging information, the Stanford Tagger tool is selected, which is the most commonly used English part-of-speech tagging tool. At the same time, two thresholds are set to limit the length of the annotation to a limited range, with 3 being the minimum length and 30 being the maximum length, respectively. Annotations outside the length range will be filtered from the dataset.

43) Identifier replacement: By observing the comments of Java methods, it is found that many identifiers in the code will also appear in the corresponding comments. Considering the relationship between the code and the comment, the identifier is replaced. For annotations, sort all unique identifiers by frequency of occurrence and select the first 30,000 identifiers as the vocabulary for the annotations. Then, for the identifier in the annotation vocabulary, four substitution operations are performed, method name substitution, method call substitution, variable type substitution, and variable name substitution. The identifier replaced in the comment matches the identifier replaced in the code. For example, if an identifier that appears in code and comments is not in both vocabularies and is replaced with <SIMPLENAME_1>, the identifier in the comments is also replaced with <SIMPLENAME_1>. In this case, even if a special tag is used to replace an identifier, the special tag can also be converted to its original form by recording and using the extracted information. The method of the invention can not only reduce the size of the annotation vocabulary, but also store and restore the original form of the special mark.

After completing all of the filtering operations above, there is no guarantee that the remaining annotations are completely accurate and free of noise. The example is just to build a high-quality data set.

(5) Code annotation generation model training

Train the code annotation generation model in the example by applying the encoding-decoding model. The encoding-decoding model has been widely used in neural network machine translation tasks. The encoding-decoding model consists of two parts: one is an encoder that maps the input sequence to a vector of a fixed dimension; the other is a decoder that decodes a vector of a fixed dimension to output the target sequence.

In this example, a long short-term memory network (LSTM) is selected as the basic neuron of the encoding-decoding model.

Use the <code, comment> pair as input for model training to train the code comment generation model.

The special tags <sos> and <eos> in the training sequence are added as the start and end tags, respectively. The final vocabulary size of the code is 30351. The model extends the encoder-decoder model by using the Tensorflow framework and is implemented in Python. Hyperparameters are determined based on the performance of the model on the validation set. Stochastic gradient descent (SGD) is used to train and update parameters. The minibatch size is set to 100, and the LSTM hidden state and word embedding dimensions are set to 512. The learning rate is first set to 0.99 and the impact factor is set to 0.8. The upper limit of the parameter gradient is 5. To avoid overfitting, set dropout to 0.3.

Train the model on the GPU. The training has approximately 70 steps. Calculate the BLEU score on the validation set to select the best model. During the decoding process, the value of the cluster search is set to 5 and the maximum generated length of the annotation is 30.

In summary, the present invention utilizes techniques such as program static analysis, neural networks, and natural language processing to automatically generate annotations for Java methods.

Two automatic machine translation metrics, BLEU-4 and METEOR, were used to evaluate the performance of the code annotation generation model. BLEU-4 has been widely used for accuracy measurement in multi-machine translation tasks. METEOR is a recall-oriented indicator. These two metrics are also used in other code comment generation tasks to measure accuracy.

Refer to Table 5, which is a data set for performance verification of the present invention. The statistical results are as follows:

table 5

Refer to Table 6, which is a data set for code annotation generation model training according to the present invention. The statistical results are as follows:

Table 6

Refer to Table 7, which shows the performance of the invention on the two BLEU-4 and METEOR metrics. The statistical results are as follows:

Table 7

The code comment generation model is named ContextCC. It can be seen from Table 7 that the method proposed by the present invention is compared with the most basic encoding-decoding model method, namely Encoder-Decoder, on the BLEU-4 and METEOR metrics. Both show better performance. Among them, the value of BLEU-4 reached 42.01%. Compared with the most basic encoding-decoding model method, the performance has been improved by nearly 10 percentage points, and the overall increase has been 30.30%. The METEOR value reached 29.26%, which is about 7 percentage points higher than the most basic encoding-decoding model method, and an overall increase of 30.98%.

There are many specific application methods of the present invention, and the above are only the preferred embodiments of the present invention. It should be noted that, for those of ordinary skill in the art, several improvements can be made without departing from the principles of the present invention. These improvements should also be regarded as the protection scope of the present invention.

Claims

A method for generating code comments based on program analysis and recurrent neural network, which is characterized in that it includes the following steps:

(1) Download the Java project and build the code base;

(2) Extracting each Java method in the Java project itself contains information and its dependent information;

(3) According to the information extracted in the above step (2), combine the heuristic method to filter and reconstruct the execution code part of each Java method;

(4) For each Java method in the Java project, analyze its Javadoc, set a template for filtering Javadoc, and combine the part-of-speech tagging method to filter the Javadoc to obtain comments matching the executed code;

(5) Combine the filtered code and matching comments into a set of <code, comment> pairs, as a training set for the code comment generation model;

(6) The training set obtained in step (5) uses the encoding-decoding model to train the code annotation generation model;

(7) After the model training is completed, prediction is performed, and a Java method execution code part is given, and corresponding comments are generated.
The method for generating code comments based on program analysis and recurrent neural network according to claim 1, characterized in that in step (2), the abstract syntax tree analysis is used to extract each Java method itself contains information and Depending on the information, the abstract syntax tree is a tree-like representation of the abstract syntax structure of the source code.
The method for generating a code comment based on a program analysis and a recurrent neural network according to claim 1 or 2, wherein in the step (2), the Java method itself includes information including a method name, local variable name information, and local variables Type information, constant values, and method call information; Java method dependency information includes class member variable information, method declaration information and qualified name information corresponding to method calls.
The method for generating code comments based on program analysis and recurrent neural network according to claim 3, wherein in the step (3), the heuristic method refers to the replacement and reconstruction of the execution code in the Java method. Rules for replacing constant values, restructuring loops and conditional structures, replacing method call information, and replacing variable names and variable types to implement filtering and reconstruction of Java methods.
The method for generating a code comment based on a program analysis and a recurrent neural network according to claim 1, wherein in the step (3), the execution code refers to program code that implements a function of a Java method.
The method for generating code comments based on program analysis and recurrent neural network according to claim 1, wherein in step (4), Javadoc refers to an application programming interface help document corresponding to each Java method, which is a A semi-structured document.
The method for generating code comments based on program analysis and recurrent neural network according to claim 1, characterized in that in the step (4), the part-of-speech tagging method is to determine the grammatical category of each word in a given sentence and determine The process of tagging and tagging.
The method for generating a code comment based on a program analysis and a recurrent neural network according to claim 1, characterized in that, in the step (5), the correspondence between <code, comment> and the code and comment embodying the Java method is Exists as a training set of code comment generation models.
The method for generating code annotations based on program analysis and recurrent neural network according to claim 1, characterized in that, in the step (6), the encoding-decoding model is a model for neural network machine translation, encoding- The decoding model consists of two parts: one is an encoder that maps the input sequence to a vector of a fixed dimension; the other is a decoder that decodes a vector of a fixed dimension to output the target sequence.