WO2019223804A1 - Method for generation of code annotations based on program analysis and recurrent neural network - Google Patents

Method for generation of code annotations based on program analysis and recurrent neural network Download PDF

Info

Publication number
WO2019223804A1
WO2019223804A1 PCT/CN2019/088516 CN2019088516W WO2019223804A1 WO 2019223804 A1 WO2019223804 A1 WO 2019223804A1 CN 2019088516 W CN2019088516 W CN 2019088516W WO 2019223804 A1 WO2019223804 A1 WO 2019223804A1
Authority
WO
WIPO (PCT)
Prior art keywords
code
java
information
neural network
recurrent neural
Prior art date
Application number
PCT/CN2019/088516
Other languages
French (fr)
Chinese (zh)
Inventor
周宇
闫鑫
黄志球
Original Assignee
南京航空航天大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京航空航天大学 filed Critical 南京航空航天大学
Publication of WO2019223804A1 publication Critical patent/WO2019223804A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention belongs to the field of software engineering technology, and particularly relates to a method for automatically generating code comments for a Java method using program static analysis, natural language processing, and neural network technology.
  • Software maintenance is the process of modifying a software system after delivery to fix errors, improve performance, or adapt to a changing environment.
  • Software maintenance requires code understanding, and reading and understanding the source code is a prerequisite for any modification.
  • Program understanding is time consuming and consumes most developer time. Developers often use integrated development environments, debuggers, and tools for code search, code testing, and program understanding to reduce tedious tasks. If the corresponding code does not have accompanying documentation explanation, due to the potential difference of the code's situation, it will bring an extra burden of understanding, and may even lead to the wrong use of the code, which will reduce the development efficiency, consume software development and maintenance resources, and even Will affect later software quality.
  • an object of the present invention is to provide a method for generating code comments based on program analysis and recurrent neural network, so as to solve the problems caused by lack of code comments in the software development and maintenance process in the prior art.
  • the invention realizes the automation of code comment generation, generates concise and accurate comments for the code, improves the readability and understandability of the code, reduces the cost of code development and maintenance, and improves the efficiency of code development and maintenance.
  • a method for generating code comments based on program analysis and recurrent neural network includes the following steps:
  • step (3) According to the information extracted in the above step (2), combine the heuristic method to filter and reconstruct the execution code part of each Java method;
  • step (5) uses the encoding-decoding model to train the code annotation generation model;
  • each Java method in the Java project itself contains information and its dependent information are extracted by parsing the abstract syntax tree.
  • the abstract syntax tree is a tree-like representation of the abstract syntax structure of the source code.
  • the Java method itself includes information including a method name, local variable name information, local variable type information, constant value (for example, a string constant), and method call information; Java method dependency information includes Class member variable information, method declaration information and qualified name information corresponding to method calls.
  • the execution code refers to program code that implements a Java method function.
  • the heuristic method refers to setting replacement and reconstruction rules for executing code in the Java method, which are used to implement constant value replacement, loop and conditional structure reconstruction, and method call information. Substitution, variable name and variable type replacement to implement filtering and refactoring of Java methods.
  • Javadoc refers to an application programming interface (API) help document corresponding to each Java method, which is a document having a semi-structured feature.
  • API application programming interface
  • the part-of-speech tagging method is a process of determining the grammatical category of each word in a given sentence, determining its part-of-speech, and tagging.
  • step (5) the correspondence between the ⁇ code, comment> and the code embodying the Java method and the comment exists as a training set of the code comment generation model.
  • the Encoder-Decoder model is a model for neural network machine translation.
  • the encoding-decoding model includes two parts: one is an encoder, which is used to convert the input The sequence is mapped to a vector of a fixed dimension; the other part is a decoder for decoding a vector of a fixed dimension to output a target sequence.
  • the present invention provides a computer-readable storage medium that stores a computer program.
  • the program is executed by a processor, the following method for generating code comments based on program analysis and recurrent neural network can be implemented:
  • Abstract syntax tree analysis is used to extract each Java method itself contains information and its dependent information.
  • the abstract syntax tree is a tree-like representation of the abstract syntax structure of the source code
  • Java method itself contains information including method name, local variable name information, local variable type information, constant value (for example, string constant), and method call information;
  • Java method dependency information includes class member variable information and method declaration corresponding to the method call Information and qualified name information;
  • step (3) According to the information extracted in the above step (2), combine the heuristic method to filter and reconstruct the execution code part of each Java method;
  • Execution code refers to program code that implements the functions of Java methods
  • heuristic methods refer to the rules for replacing and restructuring execution code in Java methods, which are used to implement constant value replacement, loop and conditional structure reconstruction, and method calls Replacement of information, replacement of variable names and variable types to achieve filtering and refactoring of Java methods
  • Javadoc refers to the application programming interface (API) help document corresponding to each Java method, which is a kind of document with semi-structural features; the part-of-speech tagging method is to determine the grammatical category of each word in a given sentence, determine its part of speech, and Process of marking
  • API application programming interface
  • step (5) uses the encoding-decoding model to train the code annotation generation model;
  • the Encoder-Decoder model is a model for neural network machine translation.
  • the encoding-decoding model contains two parts: one is the encoder, which is used to map the input sequence to a vector of a fixed dimension; the other is Is a decoder for decoding a vector of a fixed dimension to output a target sequence;
  • the computer-readable storage medium further includes a class library on which the program runs, a Java project collection, and a pre-trained code comment generation model.
  • the present invention provides a code comment generating terminal, which includes one or more processors and a memory for storing one or more programs; when the one or more programs are processed by one or more processors When executed, make one or more processors implement the following code comment generation method based on program analysis and recurrent neural network:
  • Abstract syntax tree analysis is used to extract each Java method itself contains information and its dependent information.
  • the abstract syntax tree is a tree-like representation of the abstract syntax structure of the source code
  • Java method itself contains information including method name, local variable name information, local variable type information, constant value (for example, string constant), and method call information;
  • Java method dependency information includes class member variable information and method declaration corresponding to the method call Information and qualified name information;
  • step (3) According to the information extracted in the above step (2), combine the heuristic method to filter and reconstruct the execution code part of each Java method;
  • Execution code refers to program code that implements the functions of Java methods
  • heuristic methods refer to the rules for replacing and restructuring execution code in Java methods, which are used to implement constant value replacement, loop and conditional structure reconstruction, and method calls Replacement of information, replacement of variable names and variable types to achieve filtering and refactoring of Java methods
  • Javadoc refers to the application programming interface (API) help document corresponding to each Java method, which is a kind of document with semi-structural features; the part-of-speech tagging method is to determine the grammatical category of each word in a given sentence, determine its part of speech, and Process of marking
  • API application programming interface
  • step (5) uses the encoding-decoding model to train the code annotation generation model;
  • the Encoder-Decoder model is a model for neural network machine translation.
  • the encoding-decoding model contains two parts: one is the encoder, which is used to map the input sequence to a vector of a fixed dimension; the other is Is a decoder for decoding a vector of a fixed dimension to output a target sequence;
  • the above code comment generating terminal preferably includes one or more processors of Intel i7-6700, one GEFORCE GTX 1070 Ti graphics card, two 16G DDR4 main memories and two blocks of 2T random for storing one or more programs.
  • Access memory it is configured with a program running environment, preferably including a Linux operating system, JDK (Java Development Kit) installation and environment configuration, a Python 3.6 running environment, and a Tensorflow environment configuration to support the running of the program.
  • the present invention mainly utilizes program static analysis, natural language processing, neural network and other technologies to analyze and implement automatic code annotation generation to assist developers in understanding the code, enhance the readability and understandability of the code, reduce the burden of manual understanding, and reduce software. Development and maintenance costs.
  • FIG. 1 is a schematic diagram of the preprocessing of the execution code and Javadoc in the source code according to the present invention.
  • a method for generating code comments based on program analysis and recurrent neural network of the present invention includes the following contents:
  • Neural network model training is data-driven.
  • the ultimate goal of the present invention is to train a neural network-based code annotation generation model. Therefore, a large-scale code base needs to be built to meet the needs of model training.
  • Code information refers to information related to Java methods. Since the Java method does not exist separately, the code information consists of two parts: the Java method itself contains information and its dependent information.
  • the extraction method is to use the abstract syntax tree module of Eclipse and JDT to analyze the abstract syntax tree of the Java project.
  • Eclipse JDT is an open source Java development tool; it provides a wealth of application programming interfaces (APIs) designed to implement code parsing tasks. Utilizing the many APIs in Eclipse and JDT can successfully complete the code information extraction.
  • the Java method itself contains information extraction: in order to smooth the process of code filtering and refactoring modules, some Java method itself must be extracted to include information, including method names, local variable name information, local variable type information, and constant values (for example, characters String constants) and method call information.
  • information including method names, local variable name information, local variable type information, and constant values (for example, characters String constants) and method call information.
  • Java method dependency information extraction When a Java method lacks its corresponding dependency information, the Java method may not be understood or meaningless.
  • a Java method involves the use of class member variables. Obviously, the declaration of a class member variable does not exist in the method body. Only a code segment of a single existing Java method is given, and the types of some class member variables in the Java method cannot be obtained. And initial value information, which makes it harder to understand the program. Class member variable information belongs to dependency information. Due to the lack of dependency information, developers may not fully understand the code. Java method dependency information mainly includes class member variable information, method declaration information and qualified name information corresponding to method calls.
  • Code filtering and refactoring mainly include:
  • Constant value replacement Many constant values exist in the form of aliases in Java methods. The use of aliases will increase the vocabulary. Extract the constant value information from the Java method to restore the alias to the corresponding constant value.
  • the constant values are divided into the following categories: numeric constants, string constants, and character constants.
  • Class member variable information supplement Because Java is an object-oriented programming language, Java methods usually have dependency information. In this part, we need to solve the problem of missing information about class member variables. In a given Java method, the use of class member variables may be involved, and the class member variable declaration does not exist in the method body, which means that only a code segment of a Java method that exists independently cannot be obtained. Type and initial value information of some class member variables in Java methods. Therefore, by analyzing the Java file to parse the declaration information of the related class member variables, and then supplement the missing dependency information. In order to avoid introducing too much redundant information, you need to focus only on the type information without supplementing the initialization values of class member variable declarations. In order to distinguish between class member variable declarations and local variable declarations in the current Java method, define a template (see FieldAccess section in Table 1) to save class member variable information; Table 1 is as follows:
  • Fully qualified name substitution Due to dependencies between Java files in a Java project, a fully qualified name is usually introduced. It is not possible to analyze only the current Java class file to achieve the value referred to by the fully qualified name. The entire Java project needs to be analyzed for dependency information. It is worth noting that substitutions are made only if the corresponding fully qualified name is a constant value (for example, a string constant).
  • Method call refactoring In the abstract syntax tree, a method call node can be expressed as follows:
  • the first line in the above form may not exist, it is a variable name or a fully qualified class name.
  • the second line refers to the Java method name.
  • the third line is a list of parameter names. Based on the above form, set up two templates to reconstruct the method call. If a method call does not exist in the VarName section, match the method call with a template of the form:
  • TryCatch filtering The try statement in a Java method can help catch exceptions without using the keyword throw to exit the current Java method.
  • the exception handler appears after the try statement and is identified by the catch keyword. Considering that the catch clause has nothing to do with code comments, we only focus on the body of the try statement and ignore the catch clause. As a result, the length of the code sequence can be reduced, the size of the vocabulary can be reduced, and redundant information can be reduced in order to build a high-quality data set.
  • Loop and conditional statement refactoring Some loops and conditional statements play a vital role in Java methods. To emphasize the importance of these statements, the if statements and for statements were selected and reconstructed through the templates in Table 1. The if statement is set to match the template corresponding to IfStatement, and the for statement containing two styles is set to match the templates corresponding to ForStatement and EnhancedForStatement, respectively.
  • Identifier replacement An identifier replacement mechanism is introduced to reduce the vocabulary of the code. Specifically, replace identifiers in Java methods with some specific tags. First, all identifiers in the Java method are sorted by frequency of occurrence, and the top 30,000 identifiers with the highest frequency of occurrence are selected as the code vocabulary. Then, replace those identifiers beyond those in the code vocabulary. As for the replacement operation, it can be divided into six categories, including method name replacement, method call replacement, constant value replacement, variable type replacement, variable name replacement, and method declaration replacement. Accordingly, some special tags are introduced as replacement tags, see Table 2. Considering that there is only one method name for a Java method, only a fixed tag ⁇ METHODNAME> is added as a replacement tag for the method name.
  • Method entry filtering Considering the constructor, its purpose is to create an instance object for a certain class. It is very simple and easy for developers to read and understand these Java methods, so there is no need to include these Java methods as part of the dataset. Therefore, remove all constructors from the dataset. Also, getter methods, setter methods, and test methods are eliminated. In addition, the length of a Java method's code sequence is limited to between 10 and 400. The Java method with a code sequence length of less than 10 is too simple, and the Java method with a code sequence length of more than 400 is too complicated. It is not suitable to exist in the data set, so it is filtered from the data set.
  • Javadoc filtering also plays an important role.
  • the dataset is derived from numerous Java projects on GitHub. It is not difficult to imagine that the quality of Javadoc for Java methods is uneven. Considering that the neural network model is data-driven, it is necessary to perform filtering operations for Javadoc and finally build a clean and clear data set.
  • the first sentence of Javadoc usually expresses the meaning of the entire Java method. Therefore, the first sentence of the Javadoc was chosen as a comment on the Java method. Methods without Javadoc will not be included as part of the dataset. It is not enough to simply use the first sentence in the Javadoc as a comment on a Java method. You need to perform filtering operations on the acquired comments based on this.
  • part-of-speech tagging technology to filter annotations: In order to further improve the quality of the data set, use part-of-speech tagging (POS) technology to further filter the annotations. If a comment does not contain a verb, then this comment is not sufficient as a functional description of a Java method. Such a Java method needs to be filtered out from the data set.
  • POS part-of-speech tagging
  • the Stanford Tagger tool is selected, which is the most commonly used English part-of-speech tagging tool.
  • two thresholds are set to limit the length of the annotation to a limited range, with 3 being the minimum length and 30 being the maximum length, respectively. Annotations outside the length range will be filtered from the dataset.
  • Identifier replacement By observing the comments of Java methods, it is found that many identifiers in the code will also appear in the corresponding comments. Considering the relationship between the code and the comment, the identifier is replaced. For annotations, sort all unique identifiers by frequency of occurrence and select the first 30,000 identifiers as the vocabulary for the annotations. Then, for the identifier in the annotation vocabulary, four substitution operations are performed, method name substitution, method call substitution, variable type substitution, and variable name substitution. The identifier replaced in the comment matches the identifier replaced in the code.
  • the identifier in the comments is also replaced with ⁇ SIMPLENAME_1>.
  • the special tag can also be converted to its original form by recording and using the extracted information.
  • the method of the invention can not only reduce the size of the annotation vocabulary, but also store and restore the original form of the special mark.
  • the encoding-decoding model has been widely used in neural network machine translation tasks.
  • the encoding-decoding model consists of two parts: one is an encoder that maps the input sequence to a vector of a fixed dimension; the other is a decoder that decodes a vector of a fixed dimension to output the target sequence.
  • LSTM long short-term memory network
  • the special tags ⁇ sos> and ⁇ eos> in the training sequence are added as the start and end tags, respectively.
  • the final vocabulary size of the code is 30351.
  • the model extends the encoder-decoder model by using the Tensorflow framework and is implemented in Python. Hyperparameters are determined based on the performance of the model on the validation set. Stochastic gradient descent (SGD) is used to train and update parameters. The minibatch size is set to 100, and the LSTM hidden state and word embedding dimensions are set to 512. The learning rate is first set to 0.99 and the impact factor is set to 0.8. The upper limit of the parameter gradient is 5. To avoid overfitting, set dropout to 0.3.
  • the training has approximately 70 steps. Calculate the BLEU score on the validation set to select the best model. During the decoding process, the value of the cluster search is set to 5 and the maximum generated length of the annotation is 30.
  • the present invention utilizes techniques such as program static analysis, neural networks, and natural language processing to automatically generate annotations for Java methods.
  • BLEU-4 Two automatic machine translation metrics, BLEU-4 and METEOR, were used to evaluate the performance of the code annotation generation model.
  • BLEU-4 has been widely used for accuracy measurement in multi-machine translation tasks.
  • METEOR is a recall-oriented indicator. These two metrics are also used in other code comment generation tasks to measure accuracy.
  • the code comment generation model is named ContextCC. It can be seen from Table 7 that the method proposed by the present invention is compared with the most basic encoding-decoding model method, namely Encoder-Decoder, on the BLEU-4 and METEOR metrics. Both show better performance. Among them, the value of BLEU-4 reached 42.01%. Compared with the most basic encoding-decoding model method, the performance has been improved by nearly 10 percentage points, and the overall increase has been 30.30%. The METEOR value reached 29.26%, which is about 7 percentage points higher than the most basic encoding-decoding model method, and an overall increase of 30.98%.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)

Abstract

Disclosed is a method for generation of code annotations based on program analysis and a recurrent neural network, comprising the following steps: building a large-scale code library; extracting information included in each Java method within a Java project and dependency information thereof; according to the extracted information and in combination with a heuristic method, filtering and reconstructing the execution code portion of each Java method; obtaining annotations matching the execution code; assembling the filtered code and the corresponding annotations into a code/annotation pair set, and using same as a training set for a code annotation generation model; by means of the obtained training set, using an encoding-decoding model to perform code annotation generation model training; after the model training is complete, performing prediction. The method generates simple and clear annotations, can help developers to understand code functions, accelerates software maintenance processes, and increases software product quality.

Description

一种基于程序分析和循环神经网络的代码注释生成方法Method for generating code comments based on program analysis and recurrent neural network 技术领域Technical field
本发明属于软件工程技术领域,尤其涉及一种利用程序静态分析、自然语言处理、神经网络技术,为Java方法自动生成代码注释的方法。The invention belongs to the field of software engineering technology, and particularly relates to a method for automatically generating code comments for a Java method using program static analysis, natural language processing, and neural network technology.
背景技术Background technique
随着计算机应用的不断深化,软件逐渐渗透和融合到国民经济的各个领域,软件生态发生深刻变化,新的软件形态和开发模式不断涌现,其规模和数量正以惊人速度膨胀,“软件吞噬世界”这一趋势已日渐明朗,社会需求的急剧增加给现阶段的软件生产力带来的新的挑战和机遇。With the continuous deepening of computer applications, software has gradually penetrated and integrated into all areas of the national economy, and the software ecosystem has undergone profound changes. New software forms and development models have continuously emerged. Its scale and number are expanding at an alarming rate. "This trend has become increasingly clear. The sharp increase in social demand has brought new challenges and opportunities to software productivity at this stage.
大规模实证研究表明,超过60%的软件工程资源用于软件维护。软件维护是修改软件系统的过程交付后修复错误、提高性能或适应不断变化的环境。软件维护需要代码理解,作为阅读和理解的源代码是任何修改的先决条件。而程序理解耗时并且耗费大多数开发人员时间。开发人员经常使用集成开发环境、调试器和工具进行代码搜索、代码测试和程序理解,以此来减少繁琐的任务。相应的代码如果不存在伴随文档解释的话,由于代码所处情境的潜在不同,会带来额外的理解负担,甚至有可能导致代码的错误使用,从而降低开发效率,耗费软件开发和维护资源,甚至会影响后期的软件质量。文档作为软件的要素,是辅助代码理解的重要手段。在程序开发人员在使用不熟悉的代码或者应用编程接口(API)的情况下,精确的文档已成为当前影响这些代码或API可用性(usability)的关键因素。因此,如果代码缺少伴随文档解释的话,则势必会给程序员带来额外的理解负担:毕竟开发者理解代码是一项耗时的工作。对比代码规模和复杂程度迅速增加,开发人员的对程序的理解能力并没有随之同步增长,文档的重要性更加凸显,从互联网搜寻相关文档并推荐给用户可有效加快代码理解过程,从而可以间接提高软件开发生产力。然而,并非每个代码片段都有相应的摘要解释,其次,同代码搜寻推荐类似,文档亦面临海量信息,通过通用的搜索引擎很难找到直接相关的文档解释,尽管当前存在一些工具如Doxygen7,Javadoc等可以根据标记信息如注解(annotation)等来生成结构化文档,但其内容信息依赖于用户填写,不属于自动化代码注释生成技术范畴。因此,研究新型代码摘要生成方法,以缓解互联网大数据时代信息过载的问题,已成为当前软件开发和维护人员的迫切需求。再者,从软件工程旗舰以及领域专门会议的相关论文情况来看,每年ICSE,ESEC/FSE,ASE,ICSME,SANER及MSR等都有大量代码(如API)及辅 助文档生成相关的论文发表。因此,在互联网以及开源软件生态繁荣背景下的大数据时代,代码注释生成已经受到越来越多关注并已成为一个受欢迎的研究领域,除此之外,代码注释生成技术也毋庸置疑具有着重要的理论和实用价值。Large-scale empirical research shows that more than 60% of software engineering resources are used for software maintenance. Software maintenance is the process of modifying a software system after delivery to fix errors, improve performance, or adapt to a changing environment. Software maintenance requires code understanding, and reading and understanding the source code is a prerequisite for any modification. Program understanding is time consuming and consumes most developer time. Developers often use integrated development environments, debuggers, and tools for code search, code testing, and program understanding to reduce tedious tasks. If the corresponding code does not have accompanying documentation explanation, due to the potential difference of the code's situation, it will bring an extra burden of understanding, and may even lead to the wrong use of the code, which will reduce the development efficiency, consume software development and maintenance resources, and even Will affect later software quality. Documentation, as an element of software, is an important means to assist in code understanding. With program developers using unfamiliar code or application programming interfaces (APIs), accurate documentation has become a key factor affecting the usability of these codes or APIs. Therefore, if the code lacks the accompanying documentation explanation, it will inevitably bring an additional understanding burden to the programmer: after all, the developer's understanding of the code is a time-consuming task. Compared with the rapid increase in code size and complexity, developers' understanding of the program has not increased in parallel, and the importance of documents has become more prominent. Searching for relevant documents from the Internet and recommending them to users can effectively speed up the code understanding process, which can indirectly Improve software development productivity. However, not every code snippet has a corresponding summary explanation. Secondly, similar to the code search recommendation, the document also faces a large amount of information. It is difficult to find a directly related document explanation through a general search engine, although there are currently some tools such as Doxygen7, Javadoc and others can generate structured documents based on markup information such as annotations, but their content information depends on the user to fill in, and does not fall into the category of automated code annotation generation technology. Therefore, researching new code digest generation methods to alleviate the problem of information overload in the era of Internet big data has become an urgent need for current software development and maintenance personnel. Furthermore, from the perspective of the relevant papers of the software engineering flagship and the specialized conferences in the field, every year ICSE, ESEC / FSE, ASE, ICSME, SANER, and MSR have published a large number of papers (such as API) and supporting document generation. Therefore, in the era of big data in the context of the prosperity of the Internet and open source software ecosystem, code annotation generation has received more and more attention and has become a popular research area. In addition, code annotation generation technology is undoubtedly possessing Important theoretical and practical value.
发明内容Summary of the Invention
针对于上述现有技术的不足,本发明的目的在于提供一种基于程序分析和循环神经网络的代码注释生成方法,以解决现有技术中在软件开发和维护过程中因为缺少代码注释而造成的程序可读性差、可理解性差、软件开发和维护成本增加的问题。本发明实现了代码注释生成的自动化,为代码生成简洁、准确的注释,提高代码的可读性和可理解性,降低代码开发和维护成本,提高代码开发和维护效率。In view of the above-mentioned shortcomings of the prior art, an object of the present invention is to provide a method for generating code comments based on program analysis and recurrent neural network, so as to solve the problems caused by lack of code comments in the software development and maintenance process in the prior art. The problem of poor program readability, poor understandability, and increased software development and maintenance costs. The invention realizes the automation of code comment generation, generates concise and accurate comments for the code, improves the readability and understandability of the code, reduces the cost of code development and maintenance, and improves the efficiency of code development and maintenance.
为达到上述目的,本发明采用的技术方案如下:In order to achieve the above objective, the technical solution adopted by the present invention is as follows:
本发明的一种基于程序分析和循环神经网络的代码注释生成方法,包括步骤如下:A method for generating code comments based on program analysis and recurrent neural network according to the present invention includes the following steps:
(1)下载Java项目,构建代码库;(1) Download the Java project and build the code base;
(2)提取Java项目中每个Java方法本身包含信息及其依赖信息;(2) Extracting each Java method in the Java project itself contains information and its dependent information;
(3)根据上述步骤(2)中提取到的信息,结合启发式方法对每个Java方法的执行代码部分进行过滤和重构;(3) According to the information extracted in the above step (2), combine the heuristic method to filter and reconstruct the execution code part of each Java method;
(4)对于Java项目中的每个Java方法,分析其Javadoc,设定用于过滤Javadoc的模板,并结合词性标注方法,对Javadoc进行过滤,获得与执行代码相匹配的注释;(4) For each Java method in the Java project, analyze its Javadoc, set a template for filtering Javadoc, and combine the part-of-speech tagging method to filter the Javadoc to obtain a comment that matches the executed code;
(5)将过滤后的代码及与其相匹配的注释组合成<代码,注释>对的集合,作为代码注释生成模型的训练集;(5) Combine the filtered code and matching comments into a set of <code, comment> pairs, as a training set for the code comment generation model;
(6)通过步骤(5)所获取到的训练集,利用编码-解码模型,进行代码注释生成模型的训练;(6) The training set obtained in step (5) uses the encoding-decoding model to train the code annotation generation model;
(7)模型训练完成后,进行预测,给定一个Java方法的执行代码部分,生成对应的注释。(7) After the model training is completed, prediction is performed, and a Java method execution code part is given, and corresponding comments are generated.
进一步地,所述步骤(2)中,通过抽象语法树解析来提取Java项目中每个Java方法本身包含信息及其依赖信息,该抽象语法树是源代码的抽象语法结构的树状表现形式。Further, in the step (2), each Java method in the Java project itself contains information and its dependent information are extracted by parsing the abstract syntax tree. The abstract syntax tree is a tree-like representation of the abstract syntax structure of the source code.
进一步地,所述步骤(2)中,Java方法本身包含信息包括方法名、局部变量名信息、局部变量的类型信息、常量值(例如,字符串常量)及方法调用信息; Java方法依赖信息包括类成员变量信息、方法调用对应的方法声明信息及限定名信息。Further, in the step (2), the Java method itself includes information including a method name, local variable name information, local variable type information, constant value (for example, a string constant), and method call information; Java method dependency information includes Class member variable information, method declaration information and qualified name information corresponding to method calls.
进一步地,所述步骤(3)中,执行代码是指实现Java方法功能的程序代码。Further, in the step (3), the execution code refers to program code that implements a Java method function.
进一步地,所述步骤(3)中,启发式方法是指设定了Java方法中执行代码的替换和重构规则,用于实现常量值的替换、循环和条件结构的重构、方法调用信息的替换、变量名和变量类型的替换,以实现Java方法的过滤和重构。Further, in the step (3), the heuristic method refers to setting replacement and reconstruction rules for executing code in the Java method, which are used to implement constant value replacement, loop and conditional structure reconstruction, and method call information. Substitution, variable name and variable type replacement to implement filtering and refactoring of Java methods.
进一步地,所述步骤(4)中,Javadoc指每个Java方法对应的应用程序编程接口(API)帮助文档,其为一种具有半结构特征的文档。Further, in the step (4), Javadoc refers to an application programming interface (API) help document corresponding to each Java method, which is a document having a semi-structured feature.
进一步地,所述步骤(4)中,词性标注方法是在给定句子中判定每个词的语法范畴,确定其词性并加以标注的过程。Further, in the step (4), the part-of-speech tagging method is a process of determining the grammatical category of each word in a given sentence, determining its part-of-speech, and tagging.
进一步地,所述步骤(5)中,<代码,注释>对体现Java方法的代码和注释的对应关系,其作为代码注释生成模型的训练集而存在。Further, in the step (5), the correspondence between the <code, comment> and the code embodying the Java method and the comment exists as a training set of the code comment generation model.
进一步地,所述步骤(6)中,编码-解码(Encoder-Decoder)模型是一种用于神经网络机器翻译的模型,编码-解码模型包含两个部分:一部分为编码器,用于将输入序列映射到一个固定维度的向量;另一部分为解码器,用于对固定维度的向量进行解码,以输出目标序列。Further, in the step (6), the Encoder-Decoder model is a model for neural network machine translation. The encoding-decoding model includes two parts: one is an encoder, which is used to convert the input The sequence is mapped to a vector of a fixed dimension; the other part is a decoder for decoding a vector of a fixed dimension to output a target sequence.
其次,本发明提供了一种计算机可读存储介质,其存储有计算机程序,当程序被处理器执行时,可实现如下基于程序分析和循环神经网络的代码注释生成方法:Secondly, the present invention provides a computer-readable storage medium that stores a computer program. When the program is executed by a processor, the following method for generating code comments based on program analysis and recurrent neural network can be implemented:
(1)下载Java项目,构建代码库;(1) Download the Java project and build the code base;
(2)提取Java项目中每个Java方法本身包含信息及其依赖信息;(2) Extracting each Java method in the Java project itself contains information and its dependent information;
通过抽象语法树解析来提取Java项目中每个Java方法本身包含信息及其依赖信息,该抽象语法树是源代码的抽象语法结构的树状表现形式;Abstract syntax tree analysis is used to extract each Java method itself contains information and its dependent information. The abstract syntax tree is a tree-like representation of the abstract syntax structure of the source code;
Java方法本身包含信息包括方法名、局部变量名信息、局部变量的类型信息、常量值(例如,字符串常量)及方法调用信息;Java方法依赖信息包括类成员变量信息、方法调用对应的方法声明信息及限定名信息;The Java method itself contains information including method name, local variable name information, local variable type information, constant value (for example, string constant), and method call information; Java method dependency information includes class member variable information and method declaration corresponding to the method call Information and qualified name information;
(3)根据上述步骤(2)中提取到的信息,结合启发式方法对每个Java方法的执行代码部分进行过滤和重构;(3) According to the information extracted in the above step (2), combine the heuristic method to filter and reconstruct the execution code part of each Java method;
执行代码是指实现Java方法功能的程序代码;启发式方法是指设定了Java方法中执行代码的替换和重构规则,用于实现常量值的替换、循环和条件结构的重构、方法调用信息的替换、变量名和变量类型的替换,以实现Java方法的过 滤和重构;Execution code refers to program code that implements the functions of Java methods; heuristic methods refer to the rules for replacing and restructuring execution code in Java methods, which are used to implement constant value replacement, loop and conditional structure reconstruction, and method calls Replacement of information, replacement of variable names and variable types to achieve filtering and refactoring of Java methods;
(4)对于Java项目中的每个Java方法,分析其Javadoc,设定用于过滤Javadoc的模板,并结合词性标注方法,对Javadoc进行过滤,获得与执行代码相匹配的注释;(4) For each Java method in the Java project, analyze its Javadoc, set a template for filtering Javadoc, and combine the part-of-speech tagging method to filter the Javadoc to obtain a comment that matches the executed code;
Javadoc指每个Java方法对应的应用程序编程接口(API)帮助文档,其为一种具有半结构特征的文档;词性标注方法是在给定句子中判定每个词的语法范畴,确定其词性并加以标注的过程;Javadoc refers to the application programming interface (API) help document corresponding to each Java method, which is a kind of document with semi-structural features; the part-of-speech tagging method is to determine the grammatical category of each word in a given sentence, determine its part of speech, and Process of marking
(5)将过滤后的代码及与其相匹配的注释组合成<代码,注释>对的集合,作为代码注释生成模型的训练集;(5) Combine the filtered code and matching comments into a set of <code, comment> pairs, as a training set for the code comment generation model;
<代码,注释>对体现Java方法的代码和注释的对应关系,其作为代码注释生成模型的训练集而存在;<Code, annotation> The correspondence between code and annotations embodying a Java method exists as a training set for the code annotation generation model;
(6)通过步骤(5)所获取到的训练集,利用编码-解码模型,进行代码注释生成模型的训练;(6) The training set obtained in step (5) uses the encoding-decoding model to train the code annotation generation model;
编码-解码(Encoder-Decoder)模型是一种用于神经网络机器翻译的模型,编码-解码模型包含两个部分:一部分为编码器,用于将输入序列映射到一个固定维度的向量;另一部分为解码器,用于对固定维度的向量进行解码,以输出目标序列;The Encoder-Decoder model is a model for neural network machine translation. The encoding-decoding model contains two parts: one is the encoder, which is used to map the input sequence to a vector of a fixed dimension; the other is Is a decoder for decoding a vector of a fixed dimension to output a target sequence;
(7)模型训练完成后,进行预测,给定一个Java方法的执行代码部分,生成对应的注释。(7) After the model training is completed, prediction is performed, and a Java method execution code part is given, and corresponding comments are generated.
进一步而言,上述计算机可读存储介质,还包括程序运行所依赖的类库,Java项目集合以及预训练的代码注释生成模型。Further, the computer-readable storage medium further includes a class library on which the program runs, a Java project collection, and a pre-trained code comment generation model.
第三,本发明提供了一种代码注释生成终端,其包括一个或多个处理器,以及用于存储一个或多个程序的存储器;当所述一个或多个程序被一个或多个处理器执行时,使得一个或多个处理器实现如下基于程序分析和循环神经网络的代码注释生成方法:Third, the present invention provides a code comment generating terminal, which includes one or more processors and a memory for storing one or more programs; when the one or more programs are processed by one or more processors When executed, make one or more processors implement the following code comment generation method based on program analysis and recurrent neural network:
(1)下载Java项目,构建代码库;(1) Download the Java project and build the code base;
(2)提取Java项目中每个Java方法本身包含信息及其依赖信息;(2) Extracting each Java method in the Java project itself contains information and its dependent information;
通过抽象语法树解析来提取Java项目中每个Java方法本身包含信息及其依赖信息,该抽象语法树是源代码的抽象语法结构的树状表现形式;Abstract syntax tree analysis is used to extract each Java method itself contains information and its dependent information. The abstract syntax tree is a tree-like representation of the abstract syntax structure of the source code;
Java方法本身包含信息包括方法名、局部变量名信息、局部变量的类型信息、常量值(例如,字符串常量)及方法调用信息;Java方法依赖信息包括类 成员变量信息、方法调用对应的方法声明信息及限定名信息;The Java method itself contains information including method name, local variable name information, local variable type information, constant value (for example, string constant), and method call information; Java method dependency information includes class member variable information and method declaration corresponding to the method call Information and qualified name information;
(3)根据上述步骤(2)中提取到的信息,结合启发式方法对每个Java方法的执行代码部分进行过滤和重构;(3) According to the information extracted in the above step (2), combine the heuristic method to filter and reconstruct the execution code part of each Java method;
执行代码是指实现Java方法功能的程序代码;启发式方法是指设定了Java方法中执行代码的替换和重构规则,用于实现常量值的替换、循环和条件结构的重构、方法调用信息的替换、变量名和变量类型的替换,以实现Java方法的过滤和重构;Execution code refers to program code that implements the functions of Java methods; heuristic methods refer to the rules for replacing and restructuring execution code in Java methods, which are used to implement constant value replacement, loop and conditional structure reconstruction, and method calls Replacement of information, replacement of variable names and variable types to achieve filtering and refactoring of Java methods;
(4)对于Java项目中的每个Java方法,分析其Javadoc,设定用于过滤Javadoc的模板,并结合词性标注方法,对Javadoc进行过滤,获得与执行代码相匹配的注释;(4) For each Java method in the Java project, analyze its Javadoc, set a template for filtering Javadoc, and combine the part-of-speech tagging method to filter the Javadoc to obtain a comment that matches the executed code;
Javadoc指每个Java方法对应的应用程序编程接口(API)帮助文档,其为一种具有半结构特征的文档;词性标注方法是在给定句子中判定每个词的语法范畴,确定其词性并加以标注的过程;Javadoc refers to the application programming interface (API) help document corresponding to each Java method, which is a kind of document with semi-structural features; the part-of-speech tagging method is to determine the grammatical category of each word in a given sentence, determine its part of speech, and Process of marking
(5)将过滤后的代码及与其相匹配的注释组合成<代码,注释>对的集合,作为代码注释生成模型的训练集;(5) Combine the filtered code and matching comments into a set of <code, comment> pairs, as a training set for the code comment generation model;
<代码,注释>对体现Java方法的代码和注释的对应关系,其作为代码注释生成模型的训练集而存在;<Code, annotation> The correspondence between code and annotations embodying a Java method exists as a training set for the code annotation generation model;
(6)通过步骤(5)所获取到的训练集,利用编码-解码模型,进行代码注释生成模型的训练;(6) The training set obtained in step (5) uses the encoding-decoding model to train the code annotation generation model;
编码-解码(Encoder-Decoder)模型是一种用于神经网络机器翻译的模型,编码-解码模型包含两个部分:一部分为编码器,用于将输入序列映射到一个固定维度的向量;另一部分为解码器,用于对固定维度的向量进行解码,以输出目标序列;The Encoder-Decoder model is a model for neural network machine translation. The encoding-decoding model contains two parts: one is the encoder, which is used to map the input sequence to a vector of a fixed dimension; the other is Is a decoder for decoding a vector of a fixed dimension to output a target sequence;
(7)模型训练完成后,进行预测,给定一个Java方法的执行代码部分,生成对应的注释。(7) After the model training is completed, prediction is performed, and a Java method execution code part is given, and corresponding comments are generated.
上述代码注释生成终端,优选包括一个或多个Intel i7-6700的处理器,一个GEFORCE GTX 1070 Ti显卡,2块16G的DDR4主存以及用于存储一个或多个程序的2块容量2T的随机存取存储器;其配置有程序的运行环境,优选包括Linux操作系统,JDK(Java Development Kit)的安装以及环境配置,Python 3.6运行环境,Tensorflow环境配置,用于支持程序的运行。The above code comment generating terminal preferably includes one or more processors of Intel i7-6700, one GEFORCE GTX 1070 Ti graphics card, two 16G DDR4 main memories and two blocks of 2T random for storing one or more programs. Access memory; it is configured with a program running environment, preferably including a Linux operating system, JDK (Java Development Kit) installation and environment configuration, a Python 3.6 running environment, and a Tensorflow environment configuration to support the running of the program.
本发明的有益效果:The beneficial effects of the present invention:
本发明主要利用程序静态分析、自然语言处理、神经网络等技术,分析、实现自动化的代码注释生成以辅助开发人员理解代码,增强代码地可读性和可理解性,减轻人工理解负担,降低软件开发和维护成本。The present invention mainly utilizes program static analysis, natural language processing, neural network and other technologies to analyze and implement automatic code annotation generation to assist developers in understanding the code, enhance the readability and understandability of the code, reduce the burden of manual understanding, and reduce software. Development and maintenance costs.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明对源代码中执行代码和Javadoc进行预处理的原理图。FIG. 1 is a schematic diagram of the preprocessing of the execution code and Javadoc in the source code according to the present invention.
具体实施方式Detailed ways
为了便于本领域技术人员的理解,下面结合实施例与附图对本发明作进一步的说明,实施方式提及的内容并非对本发明的限定。In order to facilitate the understanding of those skilled in the art, the present invention is further described below with reference to the embodiments and the accompanying drawings. The content mentioned in the embodiments is not a limitation on the present invention.
参照图1所示,本发明的一种基于程序分析和循环神经网络的代码注释生成方法,包括如下内容:Referring to FIG. 1, a method for generating code comments based on program analysis and recurrent neural network of the present invention includes the following contents:
(1)代码库的构建(1) Construction of the code base
神经网络模型的训练是数据驱动的,基于本发明的最终目标是训练出一个基于神经网络的代码注释生成模型,故需要构建一个大规模的代码库来满足模型训练的需要。从开源社区GitHub上下载6705个Java项目用于构建代码库。利用抽象语法树解析所述Java项目,从中提取Java方法,每个Java方法对应的Javadoc也会被提取。Neural network model training is data-driven. The ultimate goal of the present invention is to train a neural network-based code annotation generation model. Therefore, a large-scale code base needs to be built to meet the needs of model training. Download 6705 Java projects from the open source community GitHub for building code bases. The Java project is parsed by using an abstract syntax tree, and Java methods are extracted therefrom, and a Javadoc corresponding to each Java method is also extracted.
(2)代码信息提取(2) Code information extraction
为Java方法生成注释,代码信息是指Java方法相关的信息。由于Java方法并非单独存在的,因此代码信息由两部分组成:Java方法本身包含信息及其依赖信息。提取方法是利用Eclipse JDT的抽象语法树模块对Java项目进行抽象语法树解析。Eclipse JDT是一个开源Java开发工具;其提供了丰富的应用程序编程接口(API),旨在实现代码解析的任务。利用Eclipse JDT中的众多API可以顺利完成代码信息提取工作。Generate comments for Java methods. Code information refers to information related to Java methods. Since the Java method does not exist separately, the code information consists of two parts: the Java method itself contains information and its dependent information. The extraction method is to use the abstract syntax tree module of Eclipse and JDT to analyze the abstract syntax tree of the Java project. Eclipse JDT is an open source Java development tool; it provides a wealth of application programming interfaces (APIs) designed to implement code parsing tasks. Utilizing the many APIs in Eclipse and JDT can successfully complete the code information extraction.
首先,Java方法本身包含信息提取:为了平滑代码过滤和重构模块的过程,必须提取一些Java方法本身包含信息,包括方法名,局部变量名信息,局部变量的类型信息,常量值(例如,字符串常量)和方法调用信息。通过调用Eclipse JDT中的相关API对Java项目作抽象语法树解析,实现上述信息的提取。First, the Java method itself contains information extraction: in order to smooth the process of code filtering and refactoring modules, some Java method itself must be extracted to include information, including method names, local variable name information, local variable type information, and constant values (for example, characters String constants) and method call information. By calling the relevant API in Eclipse and JDT, the abstract syntax tree analysis of the Java project is performed to achieve the extraction of the above information.
其次,Java方法依赖信息提取:一个Java方法缺少其对应的依赖信息时,这个Java方法可能无法被理解或无意义。例如,一个Java方法涉及类成员变量的使用,而明显类成员变量声明并不存在于方法体内部,仅给出一个独立存在的Java方法的代码段,无法得到Java方法中一些类成员变量的类型和初始值信息, 从而加大了程序理解的难度。类成员变量信息属于依赖信息。由于缺少依赖信息,开发人员可能无法完全理解代码。Java方法依赖信息主要包含类成员变量信息,方法调用对应的方法声明信息及限定名信息。要实现依赖信息的提取,需将Java项目视为一个整体,然后调用Eclipse JDT提供的API对Java项目作抽象语法树解析,遍历每个Java方法对应的抽象语法树,然后提取相应的依赖信息。Second, Java method dependency information extraction: When a Java method lacks its corresponding dependency information, the Java method may not be understood or meaningless. For example, a Java method involves the use of class member variables. Obviously, the declaration of a class member variable does not exist in the method body. Only a code segment of a single existing Java method is given, and the types of some class member variables in the Java method cannot be obtained. And initial value information, which makes it harder to understand the program. Class member variable information belongs to dependency information. Due to the lack of dependency information, developers may not fully understand the code. Java method dependency information mainly includes class member variable information, method declaration information and qualified name information corresponding to method calls. To achieve the extraction of dependency information, you need to treat the Java project as a whole, and then call the API provided by Eclipse JDT to parse the abstract syntax tree of the Java project, traverse the abstract syntax tree corresponding to each Java method, and then extract the corresponding dependency information.
(3)代码过滤和重构(3) Code filtering and refactoring
利用上述提取的代码信息,代码过滤和重构过程可以顺利进行。代码过滤和重构主要包含:With the above extracted code information, the code filtering and reconstruction process can proceed smoothly. Code filtering and refactoring mainly include:
31)常量值替换:在Java方法中许多常量值以别名的形式存在,别名的使用会导致词汇量的增加。从Java方法中提取常量值信息,从而将别名还原为对应的常量值。31) Constant value replacement: Many constant values exist in the form of aliases in Java methods. The use of aliases will increase the vocabulary. Extract the constant value information from the Java method to restore the alias to the corresponding constant value.
示例中,将常量值划分为以下类别:数字常量,字符串常量和字符常量。通过利用Java项目中的Java文件之间的依赖信息以及抽象语法树解析,可以提取常量值信息,并且可以平滑地推进常量值替换的工作。In the example, the constant values are divided into the following categories: numeric constants, string constants, and character constants. By using the dependency information between Java files in the Java project and the abstract syntax tree parsing, constant value information can be extracted, and the work of constant value replacement can be smoothly advanced.
32)类成员变量信息补充:由于Java是一种面向对象的编程语言,因此Java方法通常存在依赖信息。在这一部分,需要解决类成员变量信息的缺失问题。在给定的一个Java方法中,也许会涉及到类成员变量的使用,而类成员变量声明并不存在于方法体内部,这意味着只给出一个独立存在的Java方法的代码段,无法得到Java方法中一些类成员变量的类型和初始值信息。故通过分析Java文件来解析相关的类成员变量的声明信息,然后补充缺失的依赖信息。为了避免引入过多的冗余信息,需要只关注类型信息而不补充类成员变量声明的初始化值。为了区分当前Java方法中的类成员变量声明和局部变量声明,定义模板(见表1中FieldAccess部分)来保存类成员变量信息;表1如下:32) Class member variable information supplement: Because Java is an object-oriented programming language, Java methods usually have dependency information. In this part, we need to solve the problem of missing information about class member variables. In a given Java method, the use of class member variables may be involved, and the class member variable declaration does not exist in the method body, which means that only a code segment of a Java method that exists independently cannot be obtained. Type and initial value information of some class member variables in Java methods. Therefore, by analyzing the Java file to parse the declaration information of the related class member variables, and then supplement the missing dependency information. In order to avoid introducing too much redundant information, you need to focus only on the type information without supplementing the initialization values of class member variable declarations. In order to distinguish between class member variable declarations and local variable declarations in the current Java method, define a template (see FieldAccess section in Table 1) to save class member variable information; Table 1 is as follows:
表1Table 1
Figure PCTCN2019088516-appb-000001
Figure PCTCN2019088516-appb-000001
33)完全限定名替换:由于Java项目中Java文件之间的依赖性,通常会引入完全限定名。不可能只分析当前的Java类文件来实现完全限定名所指代的值。 需要分析整个Java项目以获取依赖性信息。值得注意的是,仅当相应的完全限定名是常量值(例如,字符串常量)时才进行替换。33) Fully qualified name substitution: Due to dependencies between Java files in a Java project, a fully qualified name is usually introduced. It is not possible to analyze only the current Java class file to achieve the value referred to by the fully qualified name. The entire Java project needs to be analyzed for dependency information. It is worth noting that substitutions are made only if the corresponding fully qualified name is a constant value (for example, a string constant).
34)方法调用重构:在抽象语法树中,一个方法调用节点可以表示为如下形式:34) Method call refactoring: In the abstract syntax tree, a method call node can be expressed as follows:
[Expression.]Identifier([Expression{,Expression}])}[Expression.] Identifier ([Expression {, Expression}])}
为了便于理解上述形式,对其进行了如下形式的简化:In order to understand the above form, it is simplified as follows:
[VarName/QualifiedClass.][VarName / QualifiedClass.]
IdentifierIdentifier
([ParamName{,ParamName}])}([ParamName {, ParamName}])}
上述形式中的第一行可能不存在,其为变量名或者为完全限定类名。第二行指的是Java方法名。第三行是指参数名称列表。基于上述形式,设置两个模板来重建方法调用。如果出现不存在VarName部分的方法调用,将方法调用与如下形式的模板相匹配:The first line in the above form may not exist, it is a variable name or a fully qualified class name. The second line refers to the Java method name. The third line is a list of parameter names. Based on the above form, set up two templates to reconstruct the method call. If a method call does not exist in the VarName section, match the method call with a template of the form:
QualifiedClass.IdentifierQualifiedClass.Identifier
(ParamType{,ParamType}])([ParamName{,ParamName}])(ParamType {, ParamType}]) ([ParamName {, ParamName}])
否则,将方法调用与第二个模板进行匹配,模板形式如下:Otherwise, match the method call with the second template, which has the form:
(QualifiedClass)VarName.Identifier(QualifiedClass) VarName.Identifier
(ParamType{,ParamType}])([ParamName{,ParamName}])。(ParamType {, ParamType}]) ([ParamName {, ParamName}]).
35)TryCatch过滤:Java方法中的try语句可以帮助捕获异常而不需要使用关键字throw来退出当前Java方法。异常处理程序出现在try语句之后,并由关键字catch来进行标识。考虑到catch子句与代码注释无关,故只关注try语句的主体,忽略catch子句。由此,可以减少代码序列的长度,压缩词汇量的大小并削减冗余信息,以便构建高质量的数据集。35) TryCatch filtering: The try statement in a Java method can help catch exceptions without using the keyword throw to exit the current Java method. The exception handler appears after the try statement and is identified by the catch keyword. Considering that the catch clause has nothing to do with code comments, we only focus on the body of the try statement and ignore the catch clause. As a result, the length of the code sequence can be reduced, the size of the vocabulary can be reduced, and redundant information can be reduced in order to build a high-quality data set.
36)循环和条件语句重构:一些循环和条件语句在Java方法中起着至关重要的作用。为了强调这些语句的重要性,通过表1中的模板选择和重构if语句和for语句。其中,if语句设置为与IfStatement对应的模板匹配,而包含两个样式的for语句设置为分别与ForStatement和EnhancedForStatement对应的模板匹配。36) Loop and conditional statement refactoring: Some loops and conditional statements play a vital role in Java methods. To emphasize the importance of these statements, the if statements and for statements were selected and reconstructed through the templates in Table 1. The if statement is set to match the template corresponding to IfStatement, and the for statement containing two styles is set to match the templates corresponding to ForStatement and EnhancedForStatement, respectively.
为了解释循环和条件语句重构方法,表2中呈现了for语句的一种重构算法,其他循环和条件语句的重构算法是相似的。表2如下:In order to explain the method of loop and conditional sentence reconstruction, a reconstruction algorithm of the for statement is presented in Table 2. The reconstruction algorithms of other loops and conditional sentences are similar. Table 2 is as follows:
表2Table 2
Figure PCTCN2019088516-appb-000002
Figure PCTCN2019088516-appb-000002
37)标识符替换:引入一种标识符替换机制,压缩代码的词汇量。具体来说,用一些特定的标记替换Java方法中的标识符。首先,按出现频率对Java方法中所有标识符进行排序,并选择前30,000个出现频度最高的标识符作为代码词汇表。然后,对于超出代码词汇表中的那些标识符,进行相应的替换操作。至于替换操作,可以分为六个类别,包括方法名替换,方法调用替换,常量值替换,变量类型替换,变量名替换和方法声明替换。相应地,引入一些特殊的标记作为替换标记,参见表2。考虑到Java方法只存在一个方法名,只添加一个固定标记<METHODNAME>作为方法名的替换标记。认为区分字符串常量值是没有意义的,因此,用固定标记STRINGLITERAL作为字符串常量值的替换标记。类似地,用标记<CHARACTERLITERAL>替换字符常量。至于其他特殊标记,都包含一个变量i,一个非负整数,旨在相互区分。在进行标识符替换操作之后,超出代码词汇表中的标识符将被上述标记替换。在构建的数据集中,代码的最终词汇量大小为30,351。表3如下37) Identifier replacement: An identifier replacement mechanism is introduced to reduce the vocabulary of the code. Specifically, replace identifiers in Java methods with some specific tags. First, all identifiers in the Java method are sorted by frequency of occurrence, and the top 30,000 identifiers with the highest frequency of occurrence are selected as the code vocabulary. Then, replace those identifiers beyond those in the code vocabulary. As for the replacement operation, it can be divided into six categories, including method name replacement, method call replacement, constant value replacement, variable type replacement, variable name replacement, and method declaration replacement. Accordingly, some special tags are introduced as replacement tags, see Table 2. Considering that there is only one method name for a Java method, only a fixed tag <METHODNAME> is added as a replacement tag for the method name. It is considered meaningless to distinguish string constant values, so the fixed tag STRINGLITERAL is used as a replacement tag for string constant values. Similarly, replace character constants with the tag <CHARACTERLITERAL>. As for the other special tags, they all contain a variable i, a non-negative integer, which is intended to distinguish each other. After the identifier replacement operation, identifiers beyond the code vocabulary will be replaced by the above tokens. In the constructed dataset, the final vocabulary size of the code is 30,351. Table 3 is as follows
表3table 3
Figure PCTCN2019088516-appb-000003
Figure PCTCN2019088516-appb-000003
Figure PCTCN2019088516-appb-000004
Figure PCTCN2019088516-appb-000004
38)方法条目过滤:考虑到构造函数,其目的是为某个类创建一个实例对象。开发人员阅读和理解这些Java方法非常简单和容易,因此,无需将这些Java方法作为数据集的一部分。因此,从数据集中删除所有构造函数。同样,也消除了getter方法,setter方法和test方法。此外,Java方法的代码序列的长度限制在10到400之间。代码序列长度小于10的Java方法过于简单,而代码序列长度超过400的Java方法过于复杂,其都不适合在数据集中存在,故从数据集中被滤除。38) Method entry filtering: Considering the constructor, its purpose is to create an instance object for a certain class. It is very simple and easy for developers to read and understand these Java methods, so there is no need to include these Java methods as part of the dataset. Therefore, remove all constructors from the dataset. Also, getter methods, setter methods, and test methods are eliminated. In addition, the length of a Java method's code sequence is limited to between 10 and 400. The Java method with a code sequence length of less than 10 is too simple, and the Java method with a code sequence length of more than 400 is too complicated. It is not suitable to exist in the data set, so it is filtered from the data set.
(4)Javadoc过滤(4) Javadoc filtering
为了获得<代码,注释>对,Javadoc过滤也起着重要作用。数据集源于GitHub的众多Java项目。不难想象Java方法的Javadoc的质量是不均衡的。考虑到神经网络模型是由数据驱动的,必须为Javadoc进行过滤操作,并最终构建一个干净清晰的数据集。In order to get the <code, comment> pair, Javadoc filtering also plays an important role. The dataset is derived from numerous Java projects on GitHub. It is not difficult to imagine that the quality of Javadoc for Java methods is uneven. Considering that the neural network model is data-driven, it is necessary to perform filtering operations for Javadoc and finally build a clean and clear data set.
Javadoc的第一句通常表达整个Java方法的含义。因此,选择将Javadoc的第一句作为Java方法的注释。对于没有Javadoc的方法,不会将其作为数据集的一部分。只单纯地把Javadoc中的第一句话作为Java方法的注释是远远不够的,还需在此基础上对获取的注释执行过滤操作。The first sentence of Javadoc usually expresses the meaning of the entire Java method. Therefore, the first sentence of the Javadoc was chosen as a comment on the Java method. Methods without Javadoc will not be included as part of the dataset. It is not enough to simply use the first sentence in the Javadoc as a comment on a Java method. You need to perform filtering operations on the acquired comments based on this.
41)通过设定模板对注释进行过滤:部分Java方法虽然注释存在,但是其对应的注释不能提供任何有效信息来帮助程序理解。首先,很多注释表明当前的Java方法是用于测试、调试或者未实现的。其次,一些Java方法的注释是由一些工具自动生成的。除此之外,许多注释都包含警告信息,以告知不要使用这些注释或Java方法。面对这些情况,定义了用于注释过滤的模板,见表4,旨在提高最终获得的代码注释的质量。41) Filter the comments by setting a template: Although some Java methods have comments, their corresponding comments cannot provide any valid information to help the program understand. First, many comments indicate that current Java methods are used for testing, debugging, or not being implemented. Second, some Java method annotations are automatically generated by some tools. Beyond that, many annotations contain warnings to tell them not to use these annotations or Java methods. Faced with these situations, a template for comment filtering is defined, as shown in Table 4, which aims to improve the quality of the code comments finally obtained.
表4Table 4
Figure PCTCN2019088516-appb-000005
Figure PCTCN2019088516-appb-000005
Figure PCTCN2019088516-appb-000006
Figure PCTCN2019088516-appb-000006
42)利用词性标注技术过滤注释:为了进一步提高数据集的质量,采用词性标注(POS)技术对注释进一步过滤。如果一条注释中不包含动词,那么这条注释不足以作为Java方法的功能说明表述,这样的Java方法最终需从数据集中被滤除。为了利用词性标注信息,选择Stanford Tagger工具,其为最常用的英语词性标注工具。同时,设置两个阈值,使注释的长度限定在有限的范围内,分别设定3为最小长度和30为最大长度。不在长度范围要求内的注释将会被从数据集中滤除。42) Use part-of-speech tagging technology to filter annotations: In order to further improve the quality of the data set, use part-of-speech tagging (POS) technology to further filter the annotations. If a comment does not contain a verb, then this comment is not sufficient as a functional description of a Java method. Such a Java method needs to be filtered out from the data set. In order to use part-of-speech tagging information, the Stanford Tagger tool is selected, which is the most commonly used English part-of-speech tagging tool. At the same time, two thresholds are set to limit the length of the annotation to a limited range, with 3 being the minimum length and 30 being the maximum length, respectively. Annotations outside the length range will be filtered from the dataset.
43)标识符替换:通过观察Java方法的注释,发现代码中的许多标识符同样会出现在相应的注释中。考虑到代码和注释之间的关联关系,进行标识符的替换操作。对于注释,按出现频率对所有唯一标识符进行排序,并选择前30,000个标识符作为注释的词汇表。然后,对于注释词汇表中的标识符,进行四种替换操作,方法名替换,方法调用替换,变量类型替换和变量名称替换。注释中替换的标识符与代码中替换的标识符相匹配。例如,如果代码和注释中出现的一个标识符不在两个词汇表中并被替换为<SIMPLENAME_1>,则注释中的标识符同样将被替换为<SIMPLENAME_1>。在这种情况下,即使用特殊标记来替换一个标识符,也 可以通过记录和利用提取的信息将特殊标记转换为其原始形式。本发明的方法不仅可以减少注释词汇表的大小,还可以存储并恢复特殊标记的原始形式。43) Identifier replacement: By observing the comments of Java methods, it is found that many identifiers in the code will also appear in the corresponding comments. Considering the relationship between the code and the comment, the identifier is replaced. For annotations, sort all unique identifiers by frequency of occurrence and select the first 30,000 identifiers as the vocabulary for the annotations. Then, for the identifier in the annotation vocabulary, four substitution operations are performed, method name substitution, method call substitution, variable type substitution, and variable name substitution. The identifier replaced in the comment matches the identifier replaced in the code. For example, if an identifier that appears in code and comments is not in both vocabularies and is replaced with <SIMPLENAME_1>, the identifier in the comments is also replaced with <SIMPLENAME_1>. In this case, even if a special tag is used to replace an identifier, the special tag can also be converted to its original form by recording and using the extracted information. The method of the invention can not only reduce the size of the annotation vocabulary, but also store and restore the original form of the special mark.
在完成上述所有过滤操作后,无法确保剩余的注释完全准确,没有噪音。示例中仅为构建一个高质量的数据集。After completing all of the filtering operations above, there is no guarantee that the remaining annotations are completely accurate and free of noise. The example is just to build a high-quality data set.
(5)代码注释生成模型训练(5) Code annotation generation model training
通过应用编码-解码模型来训练示例中的代码注释生成模型。编码-解码模型已被广泛用于神经网络机器翻译任务。编码-解码模型包含两部分:一个部分为编码器,用于将输入序列映射到一个固定维度的向量;另一部分解码器,用于对固定维度的向量进行解码,以输出目标序列。Train the code annotation generation model in the example by applying the encoding-decoding model. The encoding-decoding model has been widely used in neural network machine translation tasks. The encoding-decoding model consists of two parts: one is an encoder that maps the input sequence to a vector of a fixed dimension; the other is a decoder that decodes a vector of a fixed dimension to output the target sequence.
在本示例中,选择长短期记忆网络(LSTM)作为编码-解码模型的基本神经元。In this example, a long short-term memory network (LSTM) is selected as the basic neuron of the encoding-decoding model.
将<代码,注释>对作为模型训练的输入,以此来训练代码注释生成模型。Use the <code, comment> pair as input for model training to train the code comment generation model.
将训练序列中的特殊标记<sos>和<eos>分别添加为开始标志和结束标志。代码的最终词汇表大小30351。该模型通过利用Tensorflow框架扩展编码器-解码器模型,并以Python语言实现。超参数是根据模型在验证集上的性能确定的。使用随机梯度下降(SGD)来训练和更新参数。minibatch大小设置为100,LSTM隐藏状态和词嵌入的维度设置为512。学习率首先设置为0.99,影响因子设为0.8。参数梯度的上限为5。为避免过拟合现象的发生,设置dropout为0.3。The special tags <sos> and <eos> in the training sequence are added as the start and end tags, respectively. The final vocabulary size of the code is 30351. The model extends the encoder-decoder model by using the Tensorflow framework and is implemented in Python. Hyperparameters are determined based on the performance of the model on the validation set. Stochastic gradient descent (SGD) is used to train and update parameters. The minibatch size is set to 100, and the LSTM hidden state and word embedding dimensions are set to 512. The learning rate is first set to 0.99 and the impact factor is set to 0.8. The upper limit of the parameter gradient is 5. To avoid overfitting, set dropout to 0.3.
在GPU上训练模型。训练大约有70个步长。在验证集上计算BLEU分数以选择最佳模型。在解码过程中,将集束搜索的值设置为5,注释的最大生成长度为30。Train the model on the GPU. The training has approximately 70 steps. Calculate the BLEU score on the validation set to select the best model. During the decoding process, the value of the cluster search is set to 5 and the maximum generated length of the annotation is 30.
综上所述,本发明利用程序静态分析、神经网络以及自然语言处理等技术,为Java方法自动生成注释。In summary, the present invention utilizes techniques such as program static analysis, neural networks, and natural language processing to automatically generate annotations for Java methods.
使用两种自动机器翻译度量标准BLEU-4和METEOR来评估代码注释生成模型的性能。BLEU-4已被广泛用于多机器翻译任务中的准确度测量。METEOR是一种面向召回的指标。这两个指标还用于其它一些代码注释生成任务,以测量准确性。Two automatic machine translation metrics, BLEU-4 and METEOR, were used to evaluate the performance of the code annotation generation model. BLEU-4 has been widely used for accuracy measurement in multi-machine translation tasks. METEOR is a recall-oriented indicator. These two metrics are also used in other code comment generation tasks to measure accuracy.
参照表5,其为本发明的性能验证的数据集,统计结果如下:Refer to Table 5, which is a data set for performance verification of the present invention. The statistical results are as follows:
表5table 5
Figure PCTCN2019088516-appb-000007
Figure PCTCN2019088516-appb-000007
参照表6,其为本发明用于代码注释生成模型训练的数据集,统计结果如下:Refer to Table 6, which is a data set for code annotation generation model training according to the present invention. The statistical results are as follows:
表6Table 6
Figure PCTCN2019088516-appb-000008
Figure PCTCN2019088516-appb-000008
参照表7,其为本发明在BLEU-4和METEOR两个度量指标上的性能,统计结果如下:Refer to Table 7, which shows the performance of the invention on the two BLEU-4 and METEOR metrics. The statistical results are as follows:
表7Table 7
Figure PCTCN2019088516-appb-000009
Figure PCTCN2019088516-appb-000009
代码注释生成模型命名为ContextCC,从表7中可以看到,本发明所提出的方法相较于最基本的编码-解码模型方法,即Encoder-Decoder,在BLEU-4和METEOR两个度量指标上均表现出更好的性能。其中,BLEU-4的值达到了42.01%,在性能上相较于最基本的编码-解码模型方法提升了将近10个百分点,整体上提高了30.30%。METEOR的值达到了29.26%,相较于最基本的编码-解码模型方法提升了大约7个百分点,整体上提高了30.98%。The code comment generation model is named ContextCC. It can be seen from Table 7 that the method proposed by the present invention is compared with the most basic encoding-decoding model method, namely Encoder-Decoder, on the BLEU-4 and METEOR metrics. Both show better performance. Among them, the value of BLEU-4 reached 42.01%. Compared with the most basic encoding-decoding model method, the performance has been improved by nearly 10 percentage points, and the overall increase has been 30.30%. The METEOR value reached 29.26%, which is about 7 percentage points higher than the most basic encoding-decoding model method, and an overall increase of 30.98%.
本发明具体应用途径很多,以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进,这些改进也应视为本发明的保护范围。There are many specific application methods of the present invention, and the above are only the preferred embodiments of the present invention. It should be noted that, for those of ordinary skill in the art, several improvements can be made without departing from the principles of the present invention. These improvements should also be regarded as the protection scope of the present invention.

Claims (9)

  1. 一种基于程序分析和循环神经网络的代码注释生成方法,其特征在于,包括步骤如下:A method for generating code comments based on program analysis and recurrent neural network, which is characterized in that it includes the following steps:
    (1)下载Java项目,构建代码库;(1) Download the Java project and build the code base;
    (2)提取Java项目中每个Java方法本身包含信息及其依赖信息;(2) Extracting each Java method in the Java project itself contains information and its dependent information;
    (3)根据上述步骤(2)中提取到的信息,结合启发式方法对每个Java方法的执行代码部分进行过滤和重构;(3) According to the information extracted in the above step (2), combine the heuristic method to filter and reconstruct the execution code part of each Java method;
    (4)对于Java项目中的每个Java方法,分析其Javadoc,设定用于过滤Javadoc的模板,并结合词性标注方法,对Javadoc进行过滤,获得与执行代码相匹配的注释;(4) For each Java method in the Java project, analyze its Javadoc, set a template for filtering Javadoc, and combine the part-of-speech tagging method to filter the Javadoc to obtain comments matching the executed code;
    (5)将过滤后的代码及与其相匹配的注释组合成<代码,注释>对的集合,作为代码注释生成模型的训练集;(5) Combine the filtered code and matching comments into a set of <code, comment> pairs, as a training set for the code comment generation model;
    (6)通过步骤(5)所获取到的训练集,利用编码-解码模型,进行代码注释生成模型的训练;(6) The training set obtained in step (5) uses the encoding-decoding model to train the code annotation generation model;
    (7)模型训练完成后,进行预测,给定一个Java方法的执行代码部分,生成对应的注释。(7) After the model training is completed, prediction is performed, and a Java method execution code part is given, and corresponding comments are generated.
  2. 根据权利要求1所述的基于程序分析和循环神经网络的代码注释生成方法,其特征在于,所述步骤(2)中,通过抽象语法树解析来提取Java项目中每个Java方法本身包含信息及其依赖信息,该抽象语法树是源代码的抽象语法结构的树状表现形式。The method for generating code comments based on program analysis and recurrent neural network according to claim 1, characterized in that in step (2), the abstract syntax tree analysis is used to extract each Java method itself contains information and Depending on the information, the abstract syntax tree is a tree-like representation of the abstract syntax structure of the source code.
  3. 根据权利要求1或2所述的基于程序分析和循环神经网络的代码注释生成方法,其特征在于,所述步骤(2)中,Java方法本身包含信息包括方法名、局部变量名信息、局部变量的类型信息、常量值及方法调用信息;Java方法依赖信息包括类成员变量信息、方法调用对应的方法声明信息及限定名信息。The method for generating a code comment based on a program analysis and a recurrent neural network according to claim 1 or 2, wherein in the step (2), the Java method itself includes information including a method name, local variable name information, and local variables Type information, constant values, and method call information; Java method dependency information includes class member variable information, method declaration information and qualified name information corresponding to method calls.
  4. 根据权利要求3所述的基于程序分析和循环神经网络的代码注释生成方法,其特征在于,所述步骤(3)中,启发式方法是指设定了Java方法中执行代码的替换和重构规则,用于实现常量值的替换、循环和条件结构的重构、方法调用信息的替换、变量名和变量类型的替换,以实现Java方法的过滤和重构。The method for generating code comments based on program analysis and recurrent neural network according to claim 3, wherein in the step (3), the heuristic method refers to the replacement and reconstruction of the execution code in the Java method. Rules for replacing constant values, restructuring loops and conditional structures, replacing method call information, and replacing variable names and variable types to implement filtering and reconstruction of Java methods.
  5. 根据权利要求1所述的基于程序分析和循环神经网络的代码注释生成方法,其特征在于,所述步骤(3)中,执行代码是指实现Java方法功能的程序代码。The method for generating a code comment based on a program analysis and a recurrent neural network according to claim 1, wherein in the step (3), the execution code refers to program code that implements a function of a Java method.
  6. 根据权利要求1所述的基于程序分析和循环神经网络的代码注释生成方 法,其特征在于,所述步骤(4)中,Javadoc指每个Java方法对应的应用程序编程接口帮助文档,其为一种具有半结构特征的文档。The method for generating code comments based on program analysis and recurrent neural network according to claim 1, wherein in step (4), Javadoc refers to an application programming interface help document corresponding to each Java method, which is a A semi-structured document.
  7. 根据权利要求1所述的基于程序分析和循环神经网络的代码注释生成方法,其特征在于,所述步骤(4)中,词性标注方法是在给定句子中判定每个词的语法范畴,确定其词性并加以标注的过程。The method for generating code comments based on program analysis and recurrent neural network according to claim 1, characterized in that in the step (4), the part-of-speech tagging method is to determine the grammatical category of each word in a given sentence and determine The process of tagging and tagging.
  8. 根据权利要求1所述的基于程序分析和循环神经网络的代码注释生成方法,其特征在于,所述步骤(5)中,<代码,注释>对体现Java方法的代码和注释的对应关系,其作为代码注释生成模型的训练集而存在。The method for generating a code comment based on a program analysis and a recurrent neural network according to claim 1, characterized in that, in the step (5), the correspondence between <code, comment> and the code and comment embodying the Java method is Exists as a training set of code comment generation models.
  9. 根据权利要求1所述的基于程序分析和循环神经网络的代码注释生成方法,其特征在于,所述步骤(6)中,编码-解码模型是一种用于神经网络机器翻译的模型,编码-解码模型包含两个部分:一部分为编码器,用于将输入序列映射到一个固定维度的向量;另一部分为解码器,用于对固定维度的向量进行解码,以输出目标序列。The method for generating code annotations based on program analysis and recurrent neural network according to claim 1, characterized in that, in the step (6), the encoding-decoding model is a model for neural network machine translation, encoding- The decoding model consists of two parts: one is an encoder that maps the input sequence to a vector of a fixed dimension; the other is a decoder that decodes a vector of a fixed dimension to output the target sequence.
PCT/CN2019/088516 2018-12-21 2019-05-27 Method for generation of code annotations based on program analysis and recurrent neural network WO2019223804A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811568185.0A CN109783079A (en) 2018-12-21 2018-12-21 A kind of code annotation generation method based on program analysis and Recognition with Recurrent Neural Network
CN201811568185.0 2018-12-21

Publications (1)

Publication Number Publication Date
WO2019223804A1 true WO2019223804A1 (en) 2019-11-28

Family

ID=66497551

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/088516 WO2019223804A1 (en) 2018-12-21 2019-05-27 Method for generation of code annotations based on program analysis and recurrent neural network

Country Status (2)

Country Link
CN (1) CN109783079A (en)
WO (1) WO2019223804A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112306494A (en) * 2020-12-03 2021-02-02 南京航空航天大学 Code classification and clustering method based on convolution and cyclic neural network
US11262985B2 (en) * 2020-03-10 2022-03-01 International Business Machines Corporation Pretraining utilizing software dependencies
US11481211B1 (en) 2021-10-06 2022-10-25 International Business Machines Corporation Dynamically creating source code comments
US20230100208A1 (en) * 2021-09-24 2023-03-30 Fujitsu Limited Code retrieval based on multi-class classification
US11928156B2 (en) 2020-11-03 2024-03-12 International Business Machines Corporation Learning-based automated machine learning code annotation with graph neural network

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783079A (en) * 2018-12-21 2019-05-21 南京航空航天大学 A kind of code annotation generation method based on program analysis and Recognition with Recurrent Neural Network
CN110750240A (en) * 2019-08-28 2020-02-04 南京航空航天大学 Code segment recommendation method based on sequence-to-sequence model
CN110750297B (en) * 2019-10-11 2021-08-20 南京大学 Python code reference information generation method based on program analysis and text analysis
CN110780874B (en) * 2019-10-25 2023-07-07 北京百度网讯科技有限公司 Method and device for generating information
CN111459491B (en) * 2020-03-17 2021-11-05 南京航空航天大学 Code recommendation method based on tree neural network
CN113535136A (en) * 2020-04-14 2021-10-22 北京沃东天骏信息技术有限公司 Python function type declaration model establishing method and device, medium and equipment
CN111522581B (en) * 2020-04-22 2021-06-25 山东师范大学 Enhanced code annotation automatic generation method and system
CN111625276B (en) * 2020-05-09 2023-04-21 山东师范大学 Code abstract generation method and system based on semantic and grammar information fusion
CN111897574B (en) * 2020-07-10 2021-09-28 福州大学 DNN program document automatic generation method
CN112114791B (en) * 2020-09-08 2022-03-25 南京航空航天大学 Code self-adaptive generation method based on meta-learning
CN112416354A (en) * 2020-10-28 2021-02-26 北京工业大学 Code readability assessment method based on multi-dimensional features and hybrid neural network
CN112364581B (en) * 2020-11-13 2023-07-25 上海兆芯集成电路股份有限公司 Method and device for automatically inserting specific codes into register transmission level design file
CN112394974B (en) * 2020-11-23 2024-05-07 平安科技(深圳)有限公司 Annotation generation method and device for code change, electronic equipment and storage medium
CN112433754B (en) * 2021-01-13 2022-05-31 南京大学 Java function annotation automatic generation method based on program analysis
CN112947930B (en) * 2021-01-29 2024-05-17 南通大学 Automatic generation method of Python pseudo code based on transducer
CN113110843B (en) * 2021-03-05 2023-04-11 卓尔智联(武汉)研究院有限公司 Contract generation model training method, contract generation method and electronic equipment
CN113064633A (en) * 2021-03-26 2021-07-02 山东师范大学 Automatic code abstract generation method and system
CN113065322B (en) * 2021-04-06 2022-02-08 中山大学 Code segment annotation generation method and system and readable storage medium
CN113076133B (en) * 2021-04-25 2023-09-26 南京大学 Deep learning-based Java program internal annotation generation method and system
CN115129364B (en) * 2022-07-05 2023-04-18 四川大学 Fingerprint identity recognition method and system based on abstract syntax tree and graph neural network
CN115599388B (en) * 2022-10-17 2023-07-21 中航信移动科技有限公司 API (application program interface) document generation method, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106843849A (en) * 2016-12-28 2017-06-13 南京大学 A kind of automatic synthesis method of the code model of the built-in function based on document
US20170329974A1 (en) * 2016-05-12 2017-11-16 Synopsys, Inc. Systems and methods for adaptive analysis of software
CN108459874A (en) * 2018-03-05 2018-08-28 中国人民解放军国防科技大学 Code automatic summarization method integrating deep learning and natural language processing
CN109783079A (en) * 2018-12-21 2019-05-21 南京航空航天大学 A kind of code annotation generation method based on program analysis and Recognition with Recurrent Neural Network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9606990B2 (en) * 2015-08-04 2017-03-28 International Business Machines Corporation Cognitive system with ingestion of natural language documents with embedded code
CN108345457B (en) * 2018-01-24 2021-03-09 上海交通大学 Method for automatically generating functional descriptive annotation for program source code
CN108519890B (en) * 2018-04-08 2021-07-20 武汉大学 Robust code abstract generation method based on self-attention mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170329974A1 (en) * 2016-05-12 2017-11-16 Synopsys, Inc. Systems and methods for adaptive analysis of software
CN106843849A (en) * 2016-12-28 2017-06-13 南京大学 A kind of automatic synthesis method of the code model of the built-in function based on document
CN108459874A (en) * 2018-03-05 2018-08-28 中国人民解放军国防科技大学 Code automatic summarization method integrating deep learning and natural language processing
CN109783079A (en) * 2018-12-21 2019-05-21 南京航空航天大学 A kind of code annotation generation method based on program analysis and Recognition with Recurrent Neural Network

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11262985B2 (en) * 2020-03-10 2022-03-01 International Business Machines Corporation Pretraining utilizing software dependencies
US11928156B2 (en) 2020-11-03 2024-03-12 International Business Machines Corporation Learning-based automated machine learning code annotation with graph neural network
CN112306494A (en) * 2020-12-03 2021-02-02 南京航空航天大学 Code classification and clustering method based on convolution and cyclic neural network
US20230100208A1 (en) * 2021-09-24 2023-03-30 Fujitsu Limited Code retrieval based on multi-class classification
US11868731B2 (en) * 2021-09-24 2024-01-09 Fujitsu Limited Code retrieval based on multi-class classification
US11481211B1 (en) 2021-10-06 2022-10-25 International Business Machines Corporation Dynamically creating source code comments

Also Published As

Publication number Publication date
CN109783079A (en) 2019-05-21

Similar Documents

Publication Publication Date Title
WO2019223804A1 (en) Method for generation of code annotations based on program analysis and recurrent neural network
US11221832B2 (en) Pruning engine
US20240004644A1 (en) Automating Identification of Code Snippets for Library Suggestion Models
Silva et al. Refdiff: detecting refactorings in version histories
US20240126543A1 (en) Library Model Addition
WO2019051422A1 (en) Automating identification of test cases for library suggestion models
WO2019075390A1 (en) Blackbox matching engine
WO2019051388A1 (en) Automating generation of library suggestion engine models
CN113127339B (en) Method for acquiring Github open source platform data and source code defect repair system
Shi et al. Are we building on the rock? on the importance of data preprocessing for code summarization
CN113076133A (en) Method and system for generating Java program internal annotation based on deep learning
CN104750484B (en) A kind of code abstraction generating method based on maximum entropy model
Chua et al. Text normalization infrastructure that scales to hundreds of language varieties
Mahbub et al. Explaining software bugs leveraging code structures in neural machine translation
Huang et al. A comparative study on method comment and inline comment
CN113778852A (en) Code analysis method based on regular expression
CN116166789A (en) Method naming accurate recommendation and examination method
Liu et al. Adaptivepaste: Code adaptation through learning semantics-aware variable usage representations
Lu et al. Data-driven program completion
Xu et al. Measurement of source code readability using word concreteness and memory retention of variable names
CN111966818A (en) Interactive API code segment recommendation method based on deep learning
Tomova et al. SEOSS-Queries-a software engineering dataset for text-to-SQL and question answering tasks
Zhou et al. DRIVE: Dockerfile Rule Mining and Violation Detection
CN117951038B (en) Rust language document test automatic generation method and device based on code large model
Wang Program Semantics-based Task Decomposition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19807540

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19807540

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19807540

Country of ref document: EP

Kind code of ref document: A1