CN111857728A

CN111857728A - Code abstract generation method and device

Info

Publication number: CN111857728A
Application number: CN202010710215.8A
Authority: CN
Inventors: 陈湘萍; 黄少豪; 周晓聪; 郑子彬
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-10-30
Anticipated expiration: 2040-07-22
Also published as: CN111857728B

Abstract

The invention discloses a code abstract generating method and a device, wherein the method comprises the following steps: coding the extracted code features to obtain a plurality of state vectors; aggregating a plurality of state vectors into one aggregated vector using an attention mechanism; decoding the aggregation vector and the last output vector to obtain the current output vector and all the output vectors; using a bidirectional model to mutually link all output vectors to obtain an output vector with optimized sequence; and sequentially obtaining output words according to the output vectors optimized in the sequence, and combining all the output words according to the sequence of the output words to obtain the code abstract. And the decoder based on the bidirectional model is used for converting all output vectors into the output vectors with optimized sequence, so that all output words can be directly combined according to the sequence of the output words when the code abstract is generated to obtain the code abstract, the accuracy of the code abstract is improved, and the model effect generated by the code abstract is also improved.

Description

Code abstract generation method and device

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a code abstract generating method and device.

Background

Code summarization generation is intended to utilize natural language processing techniques in the field of existing artificial intelligence so that a computer can understand the functions of code and generate a summary describing the code functions. This technique can help programmers to read and understand code more efficiently, so that programmers can maintain and modify programs more efficiently.

The mainstream model structure for generating the code abstract at present is encoder-decoder based authentication, an encoder is used for encoding extracted code features into state vectors, an attention mechanism is used for aggregating a plurality of state vectors output by the encoder into one state vector, a decoder is used for decoding the aggregated state vector into words to be output, and finally a sentence of abstract is obtained to describe the function of an input code.

The features of the code mainly include three types, the first is a text (play text) feature of the code, the second is an Abstract Syntax Tree (AST) feature of the code, and the third is a logical execution feature of the code.

The text characteristics of the code, as the name implies, are that the text of the code is directly used as the characteristics, in the translation task of natural language processing, when translating english into chinese, the text of english is directly used as the characteristics, and in the code abstract generation task, the text of the code is also directly used as the characteristics.

The abstract syntax tree is used as the characteristic of the code, namely the structure information (middle node) of the code can be reserved, and information (leaf nodes) such as variable names, numerical values, attributes and the like of the code can be reserved.

When the code is compiled into assembly language or byte code, the computer executes the instructions in sequence according to a line, when jump instructions such as 'goto', 'jump' and the like are encountered, the computer jumps to a certain line of instructions to continue to execute the instructions in sequence, and a logic execution diagram can be constructed according to any possible sequence of instruction execution, and the diagram is the logic execution characteristic extracted from the code.

In summary, three characteristics of the code have three expressions respectively: linear sequences, trees, graphs.

The most mainstream and the best method at present are all based on the code characteristics of an abstract syntax tree.

Deep Code Comment Generation proposes a Deep Code model that converts abstract syntax trees into their Traversal sequences using a Traversal method known as Structure-based Traversal (SBT) and then converts the Code into its digest using the Seq2Seq Attention model.

The Automatic Source Code multiplication with Extended Tree-LSTM provides a Tree-based LSTM model, coding is carried out from bottom to top from leaf nodes, for each intermediate node, the output vector of the node is obtained by utilizing the output vector of the child node and the input of the node, and finally each node on an abstract syntax Tree has one output vector to finish the coding work of a coder.

A code2seq model is provided, a plurality of pairs of leaf nodes are randomly selected on an abstract syntax tree, each pair of leaf nodes can form a path, a plurality of paths are obtained through the leaf nodes, each path is coded by using an RNN model, and the coding of the paths is obtained, namely the work of a coder is finished.

The above methods are identical except for the way in which the encoders operate, and their decoders operate as follows: (1) aggregating a plurality of vectors output by an encoder into one vector by using an Attention mechanism; (2) inputting "< START >" into the decoder initially, and inputting the output word of the decoder at the previous moment at the rest time; (3) the decoder uses the combined action of the input words and the aggregation vectors to obtain output words; (4) the above process is repeated until the decoder outputs "< END >". From the above description we can find that such a decoder is generated unidirectionally, i.e. the words of a sentence are output one after the other in sequence.

Therefore, in the prior art, attention has been mostly paid to improving the quality of information obtained from codes by improving the working mode of the encoder, and attention has been rarely paid to the improvement of the decoder.

Disclosure of Invention

The invention aims to provide a code abstract generating method and a code abstract generating device, which improve the model effect and improve the accuracy of the generated code abstract through a bidirectional decoder.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a code summary generating method, including:

coding the extracted code features to obtain a plurality of state vectors;

aggregating the number of state vectors into one aggregated vector using an attention mechanism;

decoding the aggregation vector and the last output vector to obtain the current output vector and all the output vectors;

using a bidirectional model to mutually link all the output vectors to obtain an output vector with optimized sequence;

and sequentially obtaining output words according to the output vectors optimized in the sequence, and combining all the output words according to the sequence of the output words to obtain the code abstract.

As an optional technical solution of the present invention, before the encoding the extracted code features to obtain a plurality of state vectors, the method further includes:

And extracting the characteristics of the code, wherein the characteristics are text characteristics, abstract syntax tree characteristics or logic execution characteristics of the code.

As an optional technical solution of the present invention, when the feature of the code is a text feature, the encoding the extracted code feature to obtain a plurality of state vectors includes:

encoding features of the code using a model capable of processing sequences, satisfying the following equation: z is a radical of₁,z₂,...,z_mEncoder (x) whereCoding model of physical sequence, x representing the characteristics of the code as input to the encoder, z₁,z₂,...,z_mM state vectors output by the model respectively.

As an optional technical solution of the present invention, the aggregating the plurality of state vectors into one aggregated vector using an attention mechanism includes:

using an attention mechanism to aggregate the m state vectors to obtain an aggregate vector context_tThe following formula is satisfied:

where the output of the v function is a vector, h_t-1Representing the t-1 th output vector, the output of the a-function is a constant, and the a-function satisfies the following equation:

the v function and the a function belong to the attention mechanism.

As an optional technical solution of the present invention, the decoding the aggregation vector and the previous output vector to obtain a current output vector, and thus obtaining all output vectors includes:

For the context_tAnd h is said_t-1Decoding to obtain the t output vector h_tThe following formula is satisfied: h is_t＝f(h_t-1,context_t) And f is a decoding function.

for the context_tAnd h is said_t-1Decoding to obtain the t output vector h_tThe following formula is satisfied: h is_t＝f(h_t-1,context_t,u(h_t-1) U) is a transformation function for transforming the t-1 th output vector h_t-1Transformed into another vector.

As an optional technical solution of the present invention, the using a bidirectional model to interact all the output vectors to obtain an output vector optimized in sequence includes:

judging whether the current output vector is the last one;

if yes, all the output vectors h are mutually linked and acted by using a bidirectional model to obtain an output vector o with optimized sequence.

As an optional technical solution of the present invention, the sequentially obtaining output words according to the sequentially optimized output vectors, and combining all the output words according to the sequence of the output words to obtain a code abstract includes:

and transforming the output vector o with the optimized sequence to obtain the word distribution d of the output words, wherein the word distribution d satisfies the following expression: d _t＝g(o_t)，o_tFor the t-th order optimized output vector, g is the transformation function model, d_tA word distribution for the t-th output word;

distribution of subordinate words d_tSelecting the word with the highest probability as the tth output word, and obtaining all the output words according to the tth output word;

and combining all the output words according to the sequence of the output words to obtain the code abstract.

In a second aspect, the present invention provides a code summary generating apparatus, including:

an extractor for extracting features of the code;

the coder is used for coding the extracted code characteristics to obtain a plurality of state vectors;

an attention mechanism module for aggregating the plurality of state vectors into one aggregated vector;

the decoder is used for decoding the aggregation vector and the last output vector to obtain the current output vector and all the output vectors; using a bidirectional model to mutually link all the output vectors to obtain an output vector with optimized sequence;

and the generator is used for sequentially obtaining output words according to the output vectors optimized in the sequence, and combining all the output words according to the sequence of the output words to obtain the code abstract.

As an optional technical solution of the present invention, the generator includes:

The transformation unit is used for transforming the output vector optimized in sequence to obtain word distribution of output words;

the screening unit is used for selecting the words with the highest probability from the word distribution as the output words and obtaining all the output words according to the output words;

and the combination unit is used for combining all the output words according to the sequence of the output words to obtain the code abstract.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

according to the code abstract generating method and device provided by the embodiment of the invention, after the code features are sequentially encoded and aggregated, all output vectors are converted into the sequentially optimized output vectors in a decoding stage by using a decoding mode based on a bidirectional model, so that all output words can be directly combined according to the sequence of the output words when the code abstract is generated, the code abstract is obtained, the accuracy of the code abstract is improved, and the model effect of code abstract generation is also improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the functions and purposes of the present invention, should still fall within the scope covered by the contents disclosed in the present invention.

Fig. 1 is a schematic diagram of a code digest generation method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present embodiment provides a code digest generation method based on an encoder-decoder + attribute model structure.

Specifically, the input inputs the characteristics of the code, the code characteristics are three (text characteristics, abstract syntax tree characteristics or logic execution characteristics), the coding modes corresponding to different characteristics are different, and the sequence characteristics (text characteristics) are taken as an example, and are represented by x. The encoder is responsible for encoding code features into state vectors, satisfying the following equation: z is a radical of₁,z₂,...,z_mAn encoder is a coding model that can process a sequence, x represents a feature of a code as an input of the encoder, and z is₁,z₂,...,z_mM state vectors output by the model respectively.

Specifically, the encoder model may be a model for processing a sequence such as RNN, LSTM, GRU, transform, or the like.

In fig. 1, attention mechanism attention is responsible for each input to decoder. The attention mechanism attention aggregates a plurality of output vectors of the encoder to obtain an aggregation vector context, and the aggregation vector context satisfies the following expression:

in which v is a functionThe output is a vector, h_t-1Representing the t-1 th output vector.

The output of the a-function is a constant and the a-function output can be regarded as a weighting coefficient. The a function satisfies the following equation:

The above-mentioned v function and a function belong to the attention mechanism.

Further, in the uni-directional decoder, the output vector of the decoder satisfies the following equation: h is_t＝f(h_t-1,context_t,y_t-1)。

The output of the f function is a vector, h_t-1Is decoder t-1 th output vector, context_tIs the context vector, y, from attention mechanism attention_t-1Is the word vector for the t-1 output word of decoder.

Since the present embodiment is based on a bi-directional decoder, the final output vector is obtained in an optimized order. Therefore, the word vector of the t-1 th output word of decoder cannot be directly obtained.

To this end, the present embodiment provides two innovative approaches to be applied to the code summarization method based on a bi-directional decoder. The method specifically comprises the following steps:

first, h_t＝f(h_t-1,context_t) I.e. directly removing the parameter y in the f-function_t-1So that the decoder does not need to know the last word vector when generating the next output vector;

second, h_t＝f(h_t-1,context_t,u(h_t-1) U) is a transformation function for transforming the t-1 th output vector h_t-1Conversion into another vector, i.e. using u (h)_t-1) In place of y_t-1U (h) of_t-1) Machine learning mechanism optimization may be used.

Therefore, by using the two innovative methods, the output vectors can be obtained in sequence, and all the output vectors can be obtained.

It should be noted that, when the 1 st output vector h needs to be obtained₁In the f function, the first parameter h_t-1A specific vector is used instead. I.e. initially, to the decoder "<START>"(which may also be represented by 0 in fig. 1) so that the decoder uses a particular vector instead, depending on the instruction.

Is the output vector the last determined in real time?

If so, i.e. when the last output vector (h in FIG. 1) is present_n) Then all output vectors can be interacted with each other using a bi-directional model to obtain a sequentially optimized output vector.

Specifically, bi-directional models include, but are not limited to: bidirectional RNN, bidirectional LSTM, bidirectional GRU, transformer. And (4) mutually linking all the output vectors h to obtain an output vector o with optimized sequence. The output order of the order-optimized output vector o is identical to the word order of the final generated sentence, i.e. the tth output vector o of the order-optimized output vector o_tRelated to the tth word of the sentence, so the vector o can be used_tAnd obtaining the word distribution of the t-th output word, and obtaining the t-th output word of the decoder from the word distribution.

Specifically, the word distribution d of the output word can be obtained by transforming the output vector o with optimized sequence, and the expression is: d _t＝g(o_t)，o_tFor the t-th order optimized output vector, g is the transformation function model, d_tWord distribution for the t-th output word.

Then distribute d from the word_tAnd selecting the word with the highest probability as the tth output word, and obtaining all the output words according to the tth output word.

And finally, combining all the output words according to the sequence of the output words to obtain the code abstract.

In summary, with the code abstract generating method provided in the embodiments of the present invention, after the code features are sequentially encoded and aggregated, in the decoding stage, all output vectors are converted into the sequentially optimized output vectors by using a decoding manner based on a bidirectional model, so that when the code abstract is generated, all output words can be directly combined according to the sequence of the output words to obtain the code abstract, which not only improves the accuracy of the code abstract, but also improves the model effect of code abstract generation.

For example, as a specific application scenario of the embodiment:

according to our human thinking, many times we want to say a sentence, which is not generated in sequence in the brain, we will first think of several keywords, and then get the word we want to say by recombining some conjunctions and word sequences. For example, the phrase "i feel that apple is better than banana" may be thought of by first the phrases "apple", "banana", "good", "than", "feel", "i" and then through recombination. That is, more critical words have a higher probability of being thought of first.

Therefore, compared with the prior art, the code abstract generating method provided by the embodiment of the invention has the advantages of better code abstract generating effect and higher accuracy.

In another embodiment of the present application, a code summary generating apparatus is further provided, which is used to implement the code summary generating method. Specifically, the code digest generation apparatus includes:

an extractor for extracting features of the code;

the attention mechanism module is used for aggregating a plurality of state vectors into an aggregation vector;

the decoder is used for decoding the aggregation vector and the last output vector to obtain the current output vector and all the output vectors; all output vectors are mutually linked and acted by using a bidirectional model to obtain an output vector with optimized sequence;

and the generator is used for sequentially obtaining output words according to the output vectors optimized in sequence, and combining all the output words according to the sequence of the output words to obtain the code abstract.

Further, the generator includes:

the transformation unit is used for transforming the output vector optimized in sequence to obtain word distribution of the output words;

It should be noted that the specific implementation principle of the code summary generation apparatus has been explained in the above method embodiment, and is not described herein again.

According to the code abstract generating device provided by the embodiment of the invention, after the code features are sequentially encoded and aggregated, in the decoding stage, all output vectors are converted into the sequentially optimized output vectors by using the decoder based on the bidirectional model, so that when the code abstract is generated, all output words can be directly combined according to the sequence of the output words to obtain the code abstract, the accuracy of the code abstract is improved, and the model effect of code abstract generation is also improved.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for generating a code digest, comprising:

coding the extracted code features to obtain a plurality of state vectors;

2. The method of claim 1, wherein before the encoding the extracted code features to obtain a plurality of state vectors, the method further comprises:

3. The method of claim 1, wherein when the feature of the code is a text feature, the encoding the extracted feature of the code to obtain a plurality of state vectors comprises:

Encoding features of the code using a model capable of processing sequences, satisfying the following equation: z is a radical of₁,z₂,...,z_mAn encoder is a coding model that can process a sequence, x represents a feature of a code as an input of the encoder, and z is₁,z₂,...,z_mM state vectors output by the model respectively.

4. The method of generating a code summary according to claim 3, wherein the aggregating the plurality of state vectors into one aggregated vector using an attention mechanism comprises:

the v function and the a function belong to the attention mechanism.

5. The method of claim 4, wherein the decoding the aggregate vector and the last output vector to obtain a current output vector and all output vectors therefrom comprises:

for the context_tAnd h is said_t-1Decoding to obtain the t output vector h_tThe following formula is satisfied: h is _t＝f(h_t-1,context_t) And f is a decoding function.

6. The method of claim 4, wherein the decoding the aggregate vector and the last output vector to obtain a current output vector and all output vectors therefrom comprises:

7. The method according to claim 5 or 6, wherein the using a bi-directional model to relate all the output vectors to each other to obtain a sequentially optimized output vector comprises:

judging whether the current output vector is the last one;

8. The method of claim 7, wherein the obtaining output words in sequence according to the output vectors optimized in sequence, and combining all the output words according to the sequence of the output words to obtain the code abstract comprises:

9. A code digest generation apparatus, comprising:

an extractor for extracting features of the code;

10. The code digest generation apparatus of claim 9, wherein the generator includes: