CN111651198B

CN111651198B - Automatic code abstract generation method and device

Info

Publication number: CN111651198B
Application number: CN202010312534.3A
Authority: CN
Inventors: 叶蔚; 谢睿; 张世琨; 马森; 高庆; 孙基男
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2021-04-13
Anticipated expiration: 2040-04-20
Also published as: CN111651198A

Abstract

The embodiment of the invention provides a code abstract automatic generation method and a device, wherein the method comprises the following steps: respectively encoding the input sequence based on an encoder in a code abstract generation model to obtain a semantic vector of the input sequence; decoding a semantic vector of an input sequence based on a decoder in a code abstract generation model to generate a function name and a code abstract of a code sample; calculating a loss function value of the code abstract generation model according to the generated function name, the generated code abstract and a target function name and a target code abstract of a pre-acquired code sample, and training the code abstract generation model according to the loss function value; and inputting the input sequence of the target code into a trained code abstract generation model to generate the code abstract of the target code. The embodiment of the invention trains the code abstract generating model by adopting a multi-task learning mechanism based on a function name prediction task and a code automatic abstract generating task, thereby improving the quality of the automatically generated code abstract.

Description

Automatic code abstract generation method and device

Technical Field

The invention belongs to the technical field of software maintenance, and particularly relates to a code abstract automatic generation method and device.

Background

The code abstract is a short natural language description of a section of source code, and the high-quality code abstract can effectively help the understanding of the code and the maintenance work of software.

According to statistics, in the software development life cycle, manual code digest writing needs to be carried out for sequence understanding and related activities, and a large amount of time is occupied. And the manual code summarization process is tasteless and long in time consumption, and the code summarization compiling work is usually not taken into consideration in the actual software development process. In most software projects, the problems of mismatch, obsolescence and even loss of manually written code digests often occur.

The code abstract automatic generation method assists program understanding and software maintenance by automatically generating the code abstract, can relieve software developers from tedious code abstract compiling work, effectively reduces software development and maintenance cost, and greatly improves software development efficiency. Therefore, automated generation of high-quality code digests has been one of the tasks of software engineering research.

Mainstream code digest generation methods can be classified into four methods, i.e., code digest generation based on information retrieval, code digest generation based on code keywords, code digest generation based on a statistical language model, and code digest generation based on deep learning. For example, generating a code digest by searching similar code, generating annotations from keywords extracted from the code, predicting annotations in a Java source file using a topic model and an n-gram, and aligning words in the digest directly with individual code characters by an attention-machine-based Recurrent Neural Network (RNN) model to improve the final effect of generating the digest. In addition, the code summarization task can be modeled as a machine translation problem, so a deep learning model based on sequence-to-sequence (Seq 2Seq) is widely applied to code summarization.

In the code abstract generating method based on deep learning, a deep learning model only executes a code representation learning task, namely a code abstract generating task when learning is performed, so that the code abstract effect is poor.

Disclosure of Invention

In order to overcome the problem of poor code digest effect obtained by the existing code digest automatic generation method or at least partially solve the problem, embodiments of the present invention provide a code digest automatic generation method and apparatus.

According to a first aspect of the embodiments of the present invention, there is provided a method for automatically generating a code summary, including:

respectively coding a code text sequence, an SBT sequence and a function name sequence of a code sample based on a coder in a code abstract generation model to obtain a semantic vector of the code text sequence, a semantic vector of the SBT sequence and a semantic vector of the function name sequence;

decoding semantic vectors of the code text sequence and semantic vectors of the SBT sequence based on a decoder in the code abstract generation model to generate function names of the code samples, and decoding the semantic vectors of the code text sequence, the semantic vectors of the SBT sequence and the semantic vectors of the function name sequence to generate code abstracts of the code samples;

calculating a loss function value of the code abstract generation model according to the generated function name, the generated code abstract and a target function name and a target code abstract of the code sample acquired in advance, and training the code abstract generation model according to the loss function value;

and inputting a code text sequence, an SBT sequence and a function name sequence of the target code into the trained code abstract generation model, and outputting the code abstract of the target code.

Specifically, the steps of respectively encoding a code text sequence, an SBT sequence and a function name sequence of a code sample based on an encoder in a code abstract generation model, and obtaining a semantic vector of the code text sequence, a semantic vector of the SBT sequence and a semantic vector of the function name sequence further include:

collecting a source code file from an open source software code hosting website by using a web crawler method;

analyzing the source code file, and extracting a code abstract of each code function from the analyzed source code file;

forming a code-abstract pair of each code function according to the code and the code abstract of each code function, and screening the code-abstract pairs of all the code functions; wherein the code of each code function is taken as a code sample;

preprocessing the code-abstract pairs of the screened code functions to obtain a code text sequence, an SBT sequence, a function name sequence and a code abstract sequence of each code sample; and taking the code abstract sequence as the sequence of the target code abstract, and monitoring the training of the code abstract generation model, so that the similarity between the code abstract generated by the code abstract generation model and the target code abstract is larger and larger.

Specifically, the step of screening the code-summary pairs of all the code functions includes:

acquiring the length of the code abstract in each code-abstract pair, and deleting the code-abstract pairs of which the length is greater than a first preset threshold value or less than a second preset threshold value; wherein the first preset threshold is greater than the second preset threshold;

acquiring a generation method of a code abstract in each code-abstract pair, wherein the generation method of deleting the code abstract is a code-abstract pair of an automatic generation method; the automatic generation method comprises an Eclipse automatic generation method, wherein the Eclipse automatic generation method is used for automatically generating a code abstract of a getter code function;

and judging whether the function names of the code digests and the code functions in each code-digest pair are the same or not, and deleting the code-digest pairs with the same code digests and function names.

Specifically, the step of preprocessing the code-summary pair of the screened code function to obtain the code text sequence, the SBT sequence, the function name sequence and the code summary sequence of each code sample specifically includes:

extracting a text in the code of the code function from each screened code-abstract pair, replacing a function name in the text with a preset character string, dividing the replaced text into a plurality of symbols by using separators, dividing each symbol into a plurality of sub-characters according to a hump-type naming mode, and taking the sub-characters of the text symbol in each code-abstract pair as a code text sequence corresponding to each code-abstract pair;

analyzing the code of the code function in each screened code-abstract pair based on a code abstract syntax tree analysis tool to obtain a code abstract syntax tree of the code function, traversing the code abstract syntax tree based on a structural traversal algorithm to obtain an SBT sequence corresponding to each code-abstract pair;

extracting function names of code functions from each screened code-abstract pair, taking the extracted function names as target function names of corresponding code samples, splitting the target function names according to a hump-type naming mode, and taking split results as function name sequences;

and performing word segmentation processing and preprocessing on the code abstract in each screened code-abstract pair to obtain a code abstract sequence corresponding to each code-abstract pair.

Specifically, the encoder comprises a function name encoder, an SBT encoder and a code text encoder;

the input of the function name encoder is a code text sequence of the code sample, and the output is a semantic vector of the code text sequence;

the input of the SBT encoder is an SBT sequence of the code sample, and the output is a semantic vector of the SBT sequence;

the input of the code text encoder is a function name sequence of the code sample, and the output is a semantic vector of the function name sequence;

the function name encoder, SBT encoder and code text encoder all use LSTM as encoders.

Specifically, the decoder comprises a function name decoder and a code digest decoder;

the input of the function name decoder is a semantic vector of the code text sequence and a semantic vector of the SBT sequence, and the output is a function name of the code sample;

the input of the code abstract decoder is a semantic vector of a code text sequence, a semantic vector of an SBT sequence and a semantic vector of a function name sequence, and the output is a code abstract of the code sample;

both the function name decoder and the code digest decoder use LSTM as decoders.

Specifically, the formula of the loss function of the code summary generation model is as follows:

loss＝αloss_cs+βloss_mnp

wherein loss represents a loss function of the code digest generation model, loss_csRepresents the loss of code digest prediction, loss_mnpRepresenting the predicted loss of function name, alpha and beta being the weights of the predicted loss of code digest and the predicted loss of function name, respectively, S representing all code samples in the training set, y^(cos,i)Represents the code digest, x, in the ith code-digest pair cos⁽ⁱ⁾Represents the code in the ith code-digest pair, theta represents all the parameters of the code digest generation model, P (y)^(cos,i)|x⁽ⁱ⁾(ii) a θ) represents a given x⁽ⁱ⁾The generation result of the code abstract generation model under the condition of sum theta is y^(cos,i)The probability of (a) of (b) being,

the first j characters, y, of the code abstract in the ith code-abstract pair^(mnp,i)Representing the function name in the ith code-function name pair mnp,

the first j characters representing the function name in the ith code-function name pair are represented.

According to a second aspect of the embodiments of the present invention, there is provided an apparatus for automatically generating a code summary, including:

the encoding module is used for respectively encoding a code text sequence, an SBT sequence and a function name sequence of a code sample based on an encoder in a code abstract generation model to obtain a semantic vector of the code text sequence, a semantic vector of the SBT sequence and a semantic vector of the function name sequence;

the decoding module is used for decoding the semantic vector of the code text sequence and the semantic vector of the SBT sequence based on a decoder in the code abstract generation model, generating a function name of the code sample, decoding the semantic vector of the code text sequence, the semantic vector of the SBT sequence and the semantic vector of the function name sequence, and generating a code abstract of the code sample;

the calculation module is used for calculating a loss function value of the code abstract generating model according to the generated function name, the generated code abstract and a pre-acquired target function name and a target code abstract of the code sample, and training the code abstract generating model according to the value of the loss function;

and the generating module is used for inputting the code text sequence, the SBT sequence and the function name sequence of the target code into the trained code abstract generating model and outputting the code abstract of the target code.

According to a third aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor calls the program instructions to perform the code summary automatic generation method provided in any one of the various possible implementation manners of the first aspect.

According to a fourth aspect of the embodiments of the present invention, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the code summary automatic generation method provided in any one of the various possible implementations of the first aspect.

The embodiment of the invention provides a code abstract automatic generation method and a code abstract automatic generation device, the method encodes an input sequence of a code abstract generation model through an encoder, semantic information of the input sequence is enhanced, the semantic information of the obtained input sequence is richer, a multi-task learning mechanism based on a function name prediction task and a code automatic abstract generation task is adopted to train the code abstract generation model, the improvement of the encoding capacity of the encoder is promoted, the encoder is enabled to concentrate on the extraction of key information, and the quality of the automatically generated code abstract is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without creative efforts for those skilled in the art.

FIG. 1 is a schematic diagram of an overall flow of an automatic code summary generation method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a training process and a generating process in the automatic code digest generation method according to the embodiment of the present invention;

FIG. 3 is a schematic diagram of an architecture of a code generation model in an automated code summarization generation method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the overall structure of an automatic code summary generation apparatus according to an embodiment of the present invention;

fig. 5 is a schematic view of an overall structure of an electronic device according to an embodiment of the present invention.

Detailed Description

Fig. 1 is a schematic overall flow chart of an automatic code digest generation method provided in an embodiment of the present invention, where the method includes: s101, respectively coding a code text sequence, an SBT sequence and a function name sequence of a code sample based on a coder in a code abstract generation model, and acquiring a semantic vector of the code text sequence, a semantic vector of the SBT sequence and a semantic vector of the function name sequence;

the code digest generation model is a model for automatically generating a code digest, and includes an encoder and a decoder. The encoder is used for encoding a code text sequence, an SBT (Structure-Based Traversal algorithm) sequence and a function name sequence of a code sample input into the encoder in a training process, and encoding a code text sequence, an SBT sequence and a function name sequence of a target code input into the encoder in a generating process. The target code is the code which needs to be subjected to code abstract generation. The code text sequence is a sequence generated according to a text in a code, the function name sequence is a sequence generated according to a function name, the SBT sequence is a sequence form converted from an Abstract Syntax Tree (AST) of a code sample by an SBT algorithm, and association relation between code characters and Abstract Syntax Tree nodes is reserved while Abstract Syntax Tree structure information of the code sample is not lost.

S102, decoding the semantic vector of the code text sequence and the semantic vector of the SBT sequence based on a decoder in the code abstract generation model to generate a function name of the code sample, and decoding the semantic vector of the code text sequence, the semantic vector of the SBT sequence and the semantic vector of the function name sequence to generate a code abstract of the code sample;

the decoder is used for decoding the code text sequence of the code sample input into the decoder and the semantic vector of the SBT sequence to generate a function name of the code sample in the training process, and decoding the code text sequence of the code sample input into the decoder, the SBT sequence and the semantic vector of the function name sequence to generate a code abstract of the code sample. The function name prediction task in this embodiment may be regarded as a special code summarization task, and a code of a given code function predicts a corresponding function name sequence. And decoding semantic vectors of a code text sequence, an SBT sequence and a function name sequence of the target code input into the decoder in the generation process to generate a code abstract of the target code.

In the embodiment, the code representation learning task and the code abstract generation task are two closely related tasks. The code representation is learned as generating a character-vectorized representation for each code character or a corresponding code-vectorized representation for an entire piece of code, the vectorized representation encoding semantic information of the corresponding code. The code digest task, as a downstream task, may generate a code digest using a code vectorization representation, where the higher the quality of the vectorization representation of the code, the higher the quality of the generated code digest.

S103, calculating a loss function value of a code abstract generation model according to the generated function name, the generated code abstract and a target function name and a target code abstract of the code sample acquired in advance, and training the code abstract generation model according to the loss function value;

the target function name is an actual function name of the code sample, and the target code abstract can be a code abstract of a code sample labeled manually. And acquiring function name prediction loss according to the generated function name and the target function name, and acquiring code digest prediction loss according to the generated code digest and the target code digest. And training the code abstract generating model by taking the function name prediction loss and the code abstract prediction loss as loss functions of the code abstract generating model. In the embodiment, multi-task learning is adopted, and function name prediction loss and code summary prediction loss are calculated at the same time, so that a decoder can complete own tasks and pay more attention to key information in a code sample.

And S104, inputting the code text sequence, the SBT sequence and the function name sequence of the target code into the trained code abstract generation model, and outputting the code abstract of the target code.

The target codes can be codes lacking in abstracts in the local code base, the codes are processed, the trained code abstract model is used for generating corresponding code abstracts, and the generated code abstracts are written back to the local code base, so that developers can better understand the local codes. The specific flow of the training process and the generation process is shown in fig. 2. The detailed architecture of the code digest generation model is shown in fig. 3.

The embodiment encodes the code abstract generation model input sequence through the encoder, enhances the semantic information of the input sequence, enriches the semantic information of the obtained input sequence, trains the code abstract generation model by adopting a multi-task learning mechanism based on a function name prediction task and a code automatic abstract generation task, promotes the improvement of the encoding capacity of the encoder, and also ensures that the encoder is more concentrated in the extraction of key information, thereby improving the quality of the automatically generated code abstract.

On the basis of the foregoing embodiment, in this embodiment, the steps of respectively encoding a code text sequence, an SBT sequence, and a function name sequence of a code sample based on an encoder in a code digest generation model, and acquiring a semantic vector of the code text sequence, a semantic vector of the SBT sequence, and a semantic vector of the function name sequence further include: collecting a source code file from an open source software code hosting website by using a web crawler method;

in the process of training the code abstract generation model, firstly, the open source code is crawled as a training corpus, and then the corpus is processed and trained to obtain a final code abstract model. The trained code abstract model can be reused. Data collection is carried out from an open source software code hosting website, evaluation indexes of the hosting website, such as the number of fork and star, are used for screening a crawled code warehouse, and finally, a Java file with Java as the tail end is selected from all files to serve as a source code file.

Analyzing the source code file, and extracting a code abstract of each code function from the analyzed source code file; forming a code-abstract pair of each code function according to the code and the code abstract of each code function, and screening the code-abstract pairs of all the code functions; wherein, the code of each code function is taken as a code sample;

the source code file is parsed to generate < code-summary > pairs. For a high quality source code file, each code function typically has a corresponding code digest, which is usually written over the code function in annotated form. The first sentence is extracted from the annotation of the code function as a code digest, forming a < code-digest > pair. The function name is extracted from each code function, and the function name and the code of the code function are formed into a < code-function name > pair. The low quality < code-summary > pairs are filtered resulting in a high quality < code-summary > corpus.

For example, the < code-digest > pair of code functions generated is:

public void addTableListener(Listener listener){

tableListeners.add(listener)；

}

the code digest extracted from the annotation of the code function is:

adds a listener to the current table。

comparing the function name addtableListener with the corresponding code digest may find that the code digest is typically an extension of the function name, which may assist in the generation of a high quality digest.

Preprocessing the code-abstract pairs of the screened code functions to acquire input sequences required by the code abstract generation model training, namely a code text sequence, an SBT sequence, a function name sequence and a code abstract sequence of a code sample; and the code abstract sequence is used as the sequence of the target code abstract and used for supervising the training of the code abstract generating model so as to ensure that the similarity between the code abstract generated by the code abstract generating model and the target code abstract is larger and larger.

On the basis of the above embodiment, the step of screening the code-abstract pairs of all the code functions in this embodiment includes: acquiring the length of the code abstract in each code-abstract pair, and deleting the code-abstract pairs of which the length is greater than a first preset threshold value or less than a second preset threshold value; wherein the first preset threshold is greater than the second preset threshold; if the first preset threshold is 100, the second preset threshold is 5;

On the basis of the foregoing embodiment, in this embodiment, the step of preprocessing the code-digest pair of the screened code function to obtain the code text sequence, the SBT sequence, the function name sequence, and the code digest sequence of each code sample specifically includes: extracting a text in the code of the code function from each screened code-abstract pair, replacing the function name in the text with a preset character string, dividing the replaced text into a plurality of symbols by using separators, dividing each symbol into a plurality of sub-characters according to a hump-type naming mode, and taking the sub-characters of the symbol of the text in each code-abstract pair as a code text sequence corresponding to each code-abstract pair;

for example, the function NAME is replaced with a preset character string < METHOD _ NAME > to prevent information leakage when the function NAME is generated. And dividing the text of the code by taking the space and the special symbol as separators to obtain a code symbol sequence. And finally, splitting each symbol into a plurality of sub-characters according to a hump-type naming mode to obtain a final code text sequence.

Wherein the preprocessing comprises removing all special characters and numeric characters in the code abstract and lowercase all words. The code text sequence, the SBT sequence and the function name sequence are used as the input of a code abstract generation model, the code abstract generation model generates a corresponding code abstract according to the model input, and the code abstract sequence is used as a target code abstract sequence of the model and is used for supervising the model training to enable the code abstract generated by the model to be closer to the target code abstract.

And in the generation process, acquiring a code text sequence, an SBT sequence and a function name sequence of the target code by adopting the same preprocessing method. Firstly, data acquisition is carried out from a local library to obtain a source code file, and a file taking java as the tail end is selected as the source code file. And then analyzing the source code file, extracting the code corpus lacking the abstract, and taking the code corpus as a target code. In this step, all source code files are parsed, and if a certain code function is found to lack a code digest, it is added to the digest-free code corpus. Preprocessing the code corpus without the abstract to obtain an input sequence required by code abstract generation, namely a code text sequence, an SBT sequence and a function name sequence. And taking the input sequence as the input of the code abstract generation model to generate the code abstract.

On the basis of the above embodiments, in this embodiment, the encoder includes a function name encoder, an SBT encoder, and a code text encoder; the input of the function name encoder is a code text sequence of the code sample, and the output is a semantic vector of the code text sequence; the input of the SBT encoder is an SBT sequence of the code sample, and the output is a semantic vector of the SBT sequence; the input of the code text encoder is a function name sequence of the code sample, and the output is a semantic vector of the function name sequence; the function name encoder, the SBT encoder and the code text encoder all use an LSTM (Long Short-Term Memory) network as an encoder.

Specifically, the encoding process adopts a multi-encoder architecture, and encodes the input sequence into a corresponding semantic vector, which contains semantic information of the corresponding input sequence. Compared with the traditional single encoder architecture, the multi-encoding architecture not only obtains the text semantic information of the function codes, but also obtains the structural semantic information of the function codes, and meanwhile, the function name encoder is used for enhancing the semantic information of the function names, so that the obtained semantic vector information is richer. The semantic vector of the code text sequence and the function name comprises text semantic information of the code and the function name, and the semantic vector of the SBT sequence comprises structural semantic information of the code.

On the basis of the above embodiments, the decoder in this embodiment includes a function name decoder and a code digest decoder; the input of the function name decoder is a semantic vector of the code text sequence and a semantic vector of the SBT sequence, and the output is a function name of the code sample; the input of the code abstract decoder is a semantic vector of a code text sequence, a semantic vector of an SBT sequence and a semantic vector of a function name sequence, and the output is a code abstract of the code sample; both the function name decoder and the code digest decoder use LSTM as decoders.

Specifically, the decoding process employs a multi-decoder architecture and trains both decoders simultaneously using a multi-task learning mechanism based on a function name prediction task and a code digest prediction task. The addition of the function name prediction task not only promotes the improvement of the encoding capacity of the encoder, but also enables the encoder to be more concentrated on the extraction of key information, thereby further improving the quality of the automatically generated code abstract.

The function name decoder uses LSTM as a decoder thereof, and uses an attention mechanism in the decoding process to enable the decoder to pay more attention to key information in an input sequence, and decodes semantic vectors of the SBT sequence and semantic vectors of the code text sequence into corresponding function names. The addition of the function name decoder improves the encoding capability of the SBT encoder and the code text encoder, so that the semantic information of the generated semantic vector of the SBT sequence and the semantic vector of the code text sequence is richer.

The code abstract decoder uses LSTM as a decoder thereof, and uses an attention mechanism in the decoding process to enable the decoder to pay more attention to key information in an input sequence, and decodes a semantic vector of a function name sequence, a semantic vector of an SBT sequence and a semantic vector of a code text sequence into corresponding function names.

On the basis of the above embodiments, the formula of the loss function of the code summary generation model is as follows:

loss＝αloss_cs+βloss_mnp

wherein loss represents a loss function of the code digest generation model, loss_csRepresents the loss of code digest prediction, loss_mnpAnd the alpha and the beta are respectively the weight of the code summary prediction loss and the function name prediction loss, and the influence of the function name prediction loss on the code summary generation in the training process can be controlled by adjusting the sizes of the alpha and the beta. S denotes all code samples in the training set, y^(cos,i)Represents the ith code-summary pair cosCode summary of (1), x⁽ⁱ⁾Represents the code in the ith code-digest pair, theta represents all the parameters of the code digest generation model, P (y)^(cos,i)|x⁽ⁱ⁾(ii) a θ) represents a given x⁽ⁱ⁾And the generation result of the code abstract generation model under the condition of theta is y^(cos,i)The probability of (a) of (b) being,

represents the first j characters of the code abstract in the ith code-abstract pair, y^(mnp,i)Representing the function name in the ith code-function name pair mnp,

representing the first j characters representing the function name in the ith code-function name pair. To test the effectiveness of this embodiment, all Java items with a Star number exceeding 100 on GitHub were collected, and 95% of the items were used as training set and 5% of the items were used as test set. And the training set is implemented according to the training process, and after the training is finished, evaluation is carried out on the test set.

In order to better verify the effect of generating the code abstract, BLEU and METEOR are selected as evaluation indexes. BLEU is an accuracy-based similarity metric method for analyzing the degree of co-occurrence of n-grams (n-grams) between 2 texts, often used for the evaluation of machine translation results. METEOR is an evaluation index based on accuracy and recall, takes the accuracy and recall based on the whole corpus into consideration, is another common index for evaluating corpus similarity, and aims to solve the inherent defects in some BLEUs. And the METEOR also considers synonym matching and the like, firstly, a set of calibration m is given based on a synonym library of WordNet, the set is obtained by minimizing the number ch of blocks (chunk) formed by continuous ordered words in a corresponding sentence, and then the weighted harmonic mean of the accuracy and the recall ratio between the corresponding best candidate translation and the reference translation is calculated.

In order to verify the effectiveness of the method provided by the invention, the existing models CodeNN, SBT, Seq2Seq and ast-attenggru are selected as reference models, wherein the ast-attenggru is the current best code abstract generation model. CodeNN is an end-to-end code digest model that uses a long-short term memory model to automatically generate code digests for a code. At each point in the generation, CodeNN uses an attention mechanism to make the model more focused on important information in the code. The SBT is a sequence-to-sequence machine translation model based on an attention mechanism, and the model provides a new structure-based traversal algorithm which can convert an abstract syntax tree into a sequence form, so that the model can better extract the structural information of a code. Seq2Seq is a classic implementation of a sequence-to-sequence model based on the attention mechanism, which takes a code text sequence as input and utilizes the text semantic information of the code to automatically generate a code digest for the code. and using the code text sequence and the SBT sequence as input sequences simultaneously by the ast-attendgru, and simultaneously coding the two sequences by using a multi-coder architecture.

A code abstract is generated on a training set by using the reference model and the code generation model in the embodiment, and the generated code abstract is evaluated by using BLEU and METEOR respectively, and the result is shown in table 1.

TABLE 1 comparison of results for code digest generation model

The evaluation result shows that the embodiment can automatically generate the high-quality code abstract, and compared with the existing method, the code quality is greatly improved.

In order to further check the effect of the embodiment, a local item actually developed is selected and implemented according to the generation process, wherein the code abstract model is multiplexed with the model trained in the evaluation.

The following are 4 examples in the project, in which the artificial digest is a digest written by a programmer who uses english as a mother language after reading the source code, the generated digest is a code digest automatically generated by a code digest decoder in the code digest generation model in the present embodiment, and the generated function name is a function name automatically generated by a function name decoder.

Example 1, source code: public void addteable Listener (Listener holder) front opening

tableListeners.add(listener)；

}；

Manual summarization: ads a folder to the current table;

and (3) generating an abstract: ads a inside to the table;

generating a function name: add table builder.

Example 2, source code: public void add (ELResolver elResolver) front opening

elResolver.add(elResolver)；

}；

Manual summarization: the ads the seven resolver to the list of component resolvers;

and (3) generating an abstract: the ads the seven resolver to the list of resolvers;

generating a function name: add resolver.

Example 3, source code:

manual summarization: a return true if the list contacts the current selected item;

and (3) generating an abstract: return heat the list contacts the selected item;

generating a function name: list contact selected item.

Example 4, source code:

manual summarization: the enterprises of the capacity of and the interallow reorganizates this hash table;

and (3) generating an abstract: return the map to the new capacity;

generating a function name: the capacity is rehash.

It can be seen that the generated abstract basically contains all the key information in the source code, and is basically consistent with the semantics expressed by the artificial abstract, which shows that the embodiment can effectively extract the key information in the code, and automatically generate the high-quality code abstract for the code.

In another embodiment of the present invention, an automatic code summarization generation device is provided, which is used for implementing the methods in the foregoing embodiments. Therefore, the description and definition in the embodiments of the code abstract automatic generation method can be used for understanding the execution modules in the embodiments of the present invention. Fig. 4 is a schematic diagram of an overall structure of an automatic code summary generation apparatus provided in an embodiment of the present invention, where the apparatus includes an encoding module 401, a decoding module 402, a calculating module 403, and a generating module 404, where:

the encoding module 401 is configured to encode a code text sequence, an SBT sequence, and a function name sequence of a code sample respectively based on an encoder in a code digest generation model, and obtain a semantic vector of the code text sequence, a semantic vector of the SBT sequence, and a semantic vector of the function name sequence;

the decoding module 402 is configured to decode a semantic vector of the code text sequence and a semantic vector of the SBT sequence based on a decoder in the code digest generation model, generate a function name of the code sample, and decode the semantic vector of the code text sequence, the semantic vector of the SBT sequence, and the semantic vector of the function name sequence, and generate a code digest of the code sample;

the calculation module 403 is configured to calculate a loss function value of the code digest generation model according to the generated function name, the generated code digest, and a target function name and a target code digest of the code sample, which are obtained in advance, and train the code digest generation model according to the value of the loss function;

the generating module 404 is configured to input a code text sequence, an SBT sequence, and a function name sequence of a target code into the trained code digest generation model, and output a code digest of the target code.

On the basis of the above embodiment, the present embodiment further includes an obtaining module, configured to collect a source code file from the open-source software code hosting website by using a web crawler method; analyzing the source code file, and extracting a code abstract of each code function from the analyzed source code file; forming a code-abstract pair of each code function according to the code and the code abstract of each code function, and screening the code-abstract pairs of all the code functions; wherein the code of each code function is taken as a code sample; preprocessing the code-abstract pairs of the screened code functions to obtain a code text sequence, an SBT sequence, a function name sequence and a code abstract sequence of each code sample; and taking the code abstract sequence as the sequence of the target code abstract, and monitoring the training of the code abstract generation model, so that the similarity between the code abstract generated by the code abstract generation model and the target code abstract is larger and larger.

On the basis of the foregoing embodiment, the obtaining module in this embodiment is specifically configured to: acquiring the length of a code abstract in each code-abstract pair, and deleting the code-abstract pairs of which the length is greater than a first preset threshold or less than a second preset threshold; wherein the first preset threshold is greater than the second preset threshold; acquiring a generation method of a code abstract in each code-abstract pair, wherein the generation method of deleting the code abstract is a code-abstract pair of an automatic generation method; the automatic generation method comprises an Eclipse automatic generation method, wherein the Eclipse automatic generation method is used for automatically generating a code abstract of a getter code function; and judging whether the code digests in each code-digest pair are the same as the function names of the code functions, and deleting the code-digest pairs with the same code digests and function names.

On the basis of the foregoing embodiment, the obtaining module in this embodiment is specifically configured to: extracting a text in the code of the code function from each screened code-abstract pair, replacing the function name in the text with a preset character string, dividing the replaced text into a plurality of symbols by using separators, dividing each symbol into a plurality of sub-characters according to a hump-type naming mode, and taking the sub-characters of the symbol of the text in each code-abstract pair as a code text sequence corresponding to each code-abstract pair; analyzing the code of the code function in each screened code-abstract pair based on a code abstract syntax tree analysis tool to obtain a code abstract syntax tree of the code function, traversing the code abstract syntax tree based on a structural traversal algorithm to obtain an SBT sequence corresponding to each code-abstract pair; extracting a function name of a code function from each screened code-abstract pair, taking the extracted function name as a target function name of a corresponding code sample, splitting the target function name according to a hump-type naming mode, and taking a splitting result as a function name sequence; and performing word segmentation processing and preprocessing on the code abstract in each screened code-abstract pair to obtain a code abstract sequence corresponding to each code-abstract pair.

On the basis of the above embodiments, the encoder in this embodiment includes a function name encoder, an SBT encoder, and a code text encoder; the input of the function name encoder is a code text sequence of the code sample, and the output is a semantic vector of the code text sequence; the input of the SBT encoder is an SBT sequence of the code sample, and the output is a semantic vector of the SBT sequence; the input of the code text encoder is a function name sequence of the code sample, and the output is a semantic vector of the function name sequence; the function name encoder, SBT encoder and code text encoder all use LSTM as encoders.

On the basis of the foregoing embodiments, the formula of the loss function of the code summary generation model in this embodiment is as follows:

loss＝αloss_cs+βloss_mnp

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)501, a communication Interface (Communications Interface)502, a memory (memory)503, and a communication bus 504, wherein the processor 501, the communication Interface 502, and the memory 503 are configured to communicate with each other via the communication bus 504. The processor 501 may call logic instructions in the memory 503 to perform the following method: respectively encoding the input sequence based on an encoder in a code abstract generation model to obtain a semantic vector of the input sequence; decoding a semantic vector of an input sequence based on a decoder in a code abstract generation model to generate a function name and a code abstract of a code sample; calculating a loss function value of the code abstract generation model according to the generated function name, the generated code abstract, and a target function name and a target code abstract of a pre-acquired code sample, and training the code abstract generation model according to the loss function value; and inputting the input sequence of the target code into a trained code abstract generation model to generate the code abstract of the target code.

Furthermore, the logic instructions in the memory 503 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The present embodiments provide a non-transitory computer readable storage medium storing computer instructions, the computer instructions causing a computer to perform the methods provided by the above method embodiments, for example, including: respectively coding the input sequence based on a coder in a code abstract generation model to obtain a semantic vector of the input sequence; decoding a semantic vector of an input sequence by a decoder in a code abstract generation model to generate a function name and a code abstract of a code sample; calculating the value of a loss function of the code abstract generation model according to the generated function name, the generated code abstract and a target function name and a target code abstract of a pre-acquired code sample, and training the code abstract generation model according to the value of the loss function; and inputting the input sequence of the target code into a trained code abstract generation model to generate the code abstract of the target code.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the embodiment. One of ordinary skill in the art can understand and implement the present invention without any inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may be modified or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. An automatic code abstract generating method is characterized by comprising the following steps:

respectively encoding a code text sequence, a structure-based traversal algorithm sequence and a function name sequence of a code sample by an encoder in a code abstract generation model to obtain a semantic vector of the code text sequence, a semantic vector of the structure-based traversal algorithm sequence and a semantic vector of the function name sequence;

decoding the semantic vector of the code text sequence and the semantic vector of the traversal algorithm sequence based on the structure by a decoder in the code abstract generation model to generate a function name of the code sample, and decoding the semantic vector of the code text sequence, the semantic vector of the traversal algorithm sequence based on the structure and the semantic vector of the function name sequence to generate a code abstract of the code sample;

and inputting a code text sequence, a structure-based traversal algorithm sequence and a function name sequence of the target code into the trained code abstract generation model, and outputting the code abstract of the target code.

2. The method of claim 1, wherein the step of obtaining the semantic vector of the code text sequence, the semantic vector of the structure-based traversal algorithm sequence, and the semantic vector of the function name sequence further comprises:

preprocessing the code-abstract pairs of the screened code functions to obtain a code text sequence, a structure-based traversal algorithm sequence, a function name sequence and a code abstract sequence of each code sample; and taking the code abstract sequence as the sequence of the target code abstract, and monitoring the training of the code abstract generation model, so that the similarity between the code abstract generated by the code abstract generation model and the target code abstract is larger and larger.

3. The method of claim 2, wherein the step of screening the code-digest pairs of all the code functions comprises:

4. The method according to claim 2, wherein the step of preprocessing the code-abstract pairs of the selected code functions to obtain the code text sequence, the structure-based traversal algorithm sequence, the function name sequence and the code abstract sequence of each code sample specifically comprises:

extracting a text in the code of the code function from each screened code-abstract pair, replacing a function name in the text with a preset character string, dividing the replaced text into a plurality of symbols by using separators, dividing each symbol into a plurality of sub-characters according to a hump-type naming mode, and taking the sub-characters of the symbol of the text in each code-abstract pair as a code text sequence corresponding to each code-abstract pair;

analyzing the code of the code function in each screened code-abstract pair based on a code abstract syntax tree analysis tool to obtain a code abstract syntax tree of the code function, traversing the code abstract syntax tree based on a structure traversal algorithm to obtain a structure traversal algorithm sequence corresponding to each code-abstract pair;

5. The method for automatically generating code abstracts as claimed in any one of claims 1 to 4, wherein said coder includes a function name coder, an SBT coder and a code text coder;

the input of the SBT encoder is a structure-based traversal algorithm sequence of the code sample, and the output is a semantic vector of the structure-based traversal algorithm sequence;

6. The method of any one of claims 1-4, wherein the decoder comprises a function name decoder and a code digest decoder;

the input of the function name decoder is a semantic vector of the code text sequence and a semantic vector of a structure-based traversal algorithm sequence, and the output of the function name decoder is a function name of the code sample;

the input of the code abstract decoder is a semantic vector of a code text sequence, a semantic vector of a structure-based traversal algorithm sequence and a semantic vector of a function name sequence, and the output is a code abstract of the code sample;

7. The method of any one of claims 1-4, wherein the formula of the loss function of the code summary generation model is as follows:

loss＝αloss_cs+βloss_mnp

wherein loss represents a loss function of the code digest generation model, loss_csRepresents the loss of code digest prediction, loss_mnpRepresenting the prediction loss of function name, alpha and beta are respectively the weight of the prediction loss of code abstract and the prediction loss of function name, S represents all code samples in the training set, y^(cos,i)Represents the code digest, x, in the ith code-digest pair cos⁽ⁱ⁾Represents the code in the ith code-digest pair, theta represents all the parameters of the code digest generation model, P (y)^(cos,i)|x⁽ⁱ⁾(ii) a θ) represents a given x⁽ⁱ⁾The generation result of the code abstract generation model under the condition of sum theta is y^(cos,i)The probability of (a) of (b) being,

represents the first j characters, y, of the code abstract in the ith code-abstract pair^(mnp,i)Representing the function name in the ith code-function name pair mnp,

representing the first j characters representing the function name in the ith code-function name pair.

8. An apparatus for automated generation of code summaries, comprising:

the encoding module is used for respectively encoding a code text sequence, a structure-based traversal algorithm sequence and a function name sequence of a code sample based on an encoder in a code abstract generation model to obtain a semantic vector of the code text sequence, a semantic vector of the structure-based traversal algorithm sequence and a semantic vector of the function name sequence;

the decoding module is used for decoding the semantic vector of the code text sequence and the semantic vector of the traversal algorithm sequence based on the structure based on a decoder in the code abstract generation model to generate a function name of the code sample, and decoding the semantic vector of the code text sequence, the semantic vector of the traversal algorithm sequence based on the structure and the semantic vector of the function name sequence to generate a code abstract of the code sample;

the calculation module is used for calculating the value of a loss function of the code abstract generation model according to the generated function name, the generated code abstract and a pre-acquired target function name and target code abstract of the code sample, and training the code abstract generation model according to the value of the loss function;

and the generating module is used for inputting the code text sequence, the structure-based traversal algorithm sequence and the function name sequence of the target code into the trained code abstract generating model and outputting the code abstract of the target code.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the code summary automatic generation method according to any one of claims 1 to 7 are implemented when the program is executed by the processor.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method for automated generation of a code summary according to any one of claims 1 to 7.