CN109582296B

CN109582296B - Program representation method based on stack enhanced LSTM

Info

Publication number: CN109582296B
Application number: CN201811220607.5A
Authority: CN
Inventors: 李戈; 金芝
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2018-10-19
Filing date: 2018-10-19
Publication date: 2020-12-18
Anticipated expiration: 2038-10-19
Also published as: CN109582296A

Abstract

The invention provides a program representation method based on stack enhancement LSTM, wherein the stack enhancement LSTM comprises a stack, the stack enhancement LSTM starts to access the program, and the hidden state of the stack enhancement LSTM is pressed into the stack; reading all characters in a code block of the program; returning the hidden state at the top of the stack; combining the hidden state of the top of the stack and the hidden state of the previous time step to obtain the context information of the program; representing the program based on the context information. The performance of the model in three program analysis tasks of code completion, program classification and code summary generation is superior to that of the traditional standard LSTM, which shows that the hierarchical structure information of the program is captured through the stack, so that the model can be helped to express the program language more accurately.

Description

Program representation method based on stack enhanced LSTM

Technical Field

The invention relates to the technical field of computer software engineering, in particular to a program representation method based on stack enhancement LSTM.

Background

In recent years, learning presentation programming languages has become a popular study. Statistical language models, originally designed for natural language, are now also widely used in programming languages. However, unlike natural languages, program languages contain explicit and hierarchical structural information, and thus it is difficult to learn program languages using statistical language models.

Disclosure of Invention

To solve the above problem, in the present invention, in order to represent such information, the present invention reinforces a long short term memory network (LSTM) for modeling a program language by adding a stack as a memory component to a standard LSTM to extract a hierarchical feature of the program. The effectiveness of the model is verified through three program analysis tasks. These three program analysis tasks include code completion, program classification, and code summary generation.

In particular, the invention provides a program representation method based on stack enhancement LSTM,

the stack enhancement LSTM comprises a stack, the stack enhancement LSTM starts to access the program, and the hidden state of the stack enhancement LSTM is pushed into the stack;

reading all characters in a code block of the program;

returning the hidden state at the top of the stack;

combining the hidden state of the top of the stack and the hidden state of the previous time step to obtain the context information of the program;

representing the program based on the context information.

Preferably, the hidden state is updated as the unit function of the stack enhancement LSTM is updated.

Preferably, the hidden state at the top of the stack represents information of a section of preamble code before the code block.

Preferably, the input vector of the current time step and the context information are used as the input of the unit function to update the hidden state.

Preferably, the manner of combining the hidden state at the top of the stack and the hidden state at the previous time step to obtain the context information of the program is as follows:

h_context＝fc(concat(h_begin,H_t-1))

wherein h is_contextIs context information of the program, h_beginHidden state at the top of the stack, h_t-1Hidden state for the previous time step; i.e. will h_beginAnd h_t-1Associate as a new vector and then input the vector to a fully connected layer.

Preferably, the manner of combining the hidden state at the top of the stack and the hidden state at the previous time step to obtain the context information of the program is at h_beginAnd h_t-1Making the largest pooling:

h_context＝maxpooling(h_begin,h_t-1)

wherein h is_contextIs context information of the program, h_beginHidden state at the top of the stack, h_t-1Is the hidden state of the previous time step.

h_conext＝cell(h_t-1,h_begin)

wherein h is_contextIs context information of the program, h_beginHidden state at the top of the stack, h_t-1Hidden state for the previous time step; that is, the hidden state of the previous time step is used as a new input vector, and then the current state is set as h_beginThen execute the unit function to get h_contextThe cell is a hidden state update function of the LSTM.

According to another aspect of the present invention, there is also provided a code completion method, which applies the above-mentioned stack-enhanced LSTM-based program representation method to perform code completion.

According to another aspect of the invention, a program classification method is further provided, and the program classification is carried out by applying the stack enhancement LSTM-based program representation method.

According to another aspect of the present invention, there is also provided a code summary generation method, which applies the above-mentioned stack enhancement LSTM-based program representation method to perform code summary generation.

The invention has the advantages that: the model of the invention has better performance in all three program analysis tasks than the traditional standard LSTM, which shows that the hierarchical structure information of the program is captured by the stack, so that the model can be helped to more accurately represent the program language.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a diagram of an abstract syntax tree model of a program.

FIG. 2 is a schematic diagram of a program representation method based on stack enhanced LSTM according to the present invention.

FIG. 3 is a flow chart of the program representation algorithm of the present invention based on stack enhanced LSTM.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The invention provides a language model, and particularly provides an LSTM learning program based on stack enhancement. A Recurrent Neural Network (RNN) using a stack as a memory unit can capture more information in a data sequence, and thus can improve network performance. The present invention reinforces a long short term memory network (LSTM) for modeling a programming language by adding a stack as a memory component to a standard LSTM to extract hierarchical features of the program. The stack is used for dynamically storing and recovering important information, thereby helping a network to pay more attention to important contextual information and discover the hierarchical structure attribute of source code.

In LSTM, the hidden state makes it difficult to remember all the context-related information, especially for very long speech segments. Thus, important contextual information may be overwritten and lost in the process of updating the hidden state. Further, LSTM lacks the ability to capture data structure information. In order to overcome the defects of the LSTM, the invention provides a program representation method based on stack augmentation (stack augmentation) LSTM (SA-LSTM).

Fig. 1 is a diagram of an abstract syntax tree model of a conventional program. The arrows in the figure indicate the order of operation of the nodes. We can access the tree along the direction of the arrow to get the input character.

FIG. 2 is a schematic diagram of a program representation method based on stack enhanced LSTM according to the present invention. In the figure, circles indicate hidden units at different time steps (time steps). The model reads one input character at each time step. The background pattern of the hidden unit corresponds to a node in the abstract syntax tree model in fig. 1, which means that the node in the abstract syntax tree is the input of the hidden unit with the same background pattern. We use a stack (stack) to record important information of the input. The stack memory has mainly two operations:

PUSH (PUSH operation): will be in the current hidden state h_tPushed into the stack.

POP (POP operation): and returning and deleting the elements at the top of the stack.

Since the hidden state represents information of the input data, the push and pop operations on the hidden state enable the SA-LSTM to dynamically store and retrieve context information during the learning process. In this manner, the model of the present invention is able to discover structural attributes of a program.

In fig. 1, each non-leaf node of the abstract syntax tree represents a type of a code block.

FIG. 3 shows a flowchart of the SA-LSTM algorithm of the present invention. We use V_startAnd V_endIndicating the start and end symbols of the code block, e.g. "and". The following three cases are included:

(1) lines 3-5 of the algorithm: the input character is "{", that is, the SA-LSTM model accesses a non-leaf node and prepares to learn a new code block, and we push the hidden state onto the stack. The hidden state is then updated with the cell function of the LSTM.

(2) Lines 6-9 of the algorithm: the input character is "}", that is, the model has read all characters in the code block and reached its end, we pull the stack on the stack to get the hidden state h_beginSaid hidden state h_beginInformation representing a section of preamble code preceding said block of code. Then, as shown in FIG. 2, h is bonded_beginAnd h_t-1To obtain key information of the alpha function. h is_beginIs a hidden state generated at the beginning of the code block, h_t-1Is the hidden state of the previous time step, i.e. the hidden state generated at the end of the code block. We store this valuable information in h_contextIn (1). Then we use the input vector x at time step t_tAnd h_contextAs input to the cell function to update the hidden state.

(3) Lines 10-11 of the algorithm: for other cases we still take the same steps as the standard LSTM network to update the hidden state.

In the present invention, the alpha function is any trainable function designed to extract important context information when a network accesses a complete code block. We can use different methods to implement the alpha function. The invention is based on the object of extracting valuable context information from h_beginAnd h_t-1The following alpha function is used in extracting context information:

(i)h_context＝fc(concat(h_begin,h_t-1))

i.e. will h_beginAnd h_t-1Is associated as a new vector and then input to a fully connected layer.

(ii)h_context＝maxpooling(h_begin,h_t-1)

At h_beginAnd h_t-1Making maximum pooling between them.

(iii)h_context＝cell(h_t-1,h_begin)

Hidden state h of previous time step_t-1Can be used as a "summary" of the context. Take it as a new input vector and then set the current state to h_beginThen execute the unit function to get h_contextThe cell is a hidden state update function of the LSTM.

Experiments have shown that the effect of the third alpha function is better than the first two. The operation of this alpha function can be explained as follows: when the model reads all the characters in the code block, it forgets the details of the code block as if it were not seen. Somebody then tells the model that some of the key information has been forgotten, the model stores this information in a hidden state, and then continues the learning process.

The invention only clears irrelevant memory at the end of the code block. Implementation details of the code blocks can still be learned by the model, since the model performs the learning process by optimizing network parameters. These parameters are shared throughout the learning process. In this way, the SA-LSTM is able to learn all the characters in the program, in which process the hidden state only remembers important context information.

In a preferred embodiment of the present invention, the SA-LSTM network is capable of dynamically remembering the relevant context when input characters are entered into the network in sequence. In this manner, the model of the present invention is able to capture the hierarchical structure information of the source code. Further, the memory burden of the hidden state can be reduced to a certain extent, so that the long-term dependence problem is solved.

Experiment and results

In order to show the technical effect of the invention, the invention also makes a control experiment. The effectiveness of the model is verified through three program analysis tasks. These three program analysis tasks include code completion, program classification, and code summary generation.

(1) Code completion

Code completion can be considered a prediction task, i.e. predicting what the next character is given a partial context. The features of the partial context can be extracted by the SA-LSTM model of the present invention. The invention adopts C language code and Python code from a source database and a system to carry out experiments. Where the Python code comes from the gitubb open source platform, selecting Python items with more than five stars. And then randomly segmenting the data to carry out training, verification and testing. The results are as follows:

TABLE 1

Table 1 shows the accuracy of the standard LSTM and the SA-LSTM models of the present invention in predicting the next terminal (terminal) or non-terminal (non-terminal) characters of the C language code and Python code described above. The non-terminal characters contain structural information of the program, and the terminal characters contain semantic information of the program. As shown in Table 1, the prediction accuracy of the SA-LSTM model of the present invention is higher than that of the conventional LSTM model. The prediction result of the non-terminal character shows that the model of the invention can learn the structure information of the program source code, which is very important in the code completion task. The prediction result of the terminal character shows that the model of the invention can learn the semantic information of the program source code. The results in table 1 show that the present invention performs better than the prior art in capturing the hierarchical structure information of the program.

(2) Program classification

The classification of programs according to function is very important in software engineering. The model of the invention can be applied to program classification tasks. The hidden state at the previous time step contains information of the complete program and therefore the present invention can use it as an expression and feature for program classification. In this task, the invention uses C language code from a development database and system to perform experiments. And then randomly segmenting the data to carry out training, verification and testing. The results are as follows:

TABLE 2

The results in Table 2 show that the accuracy of the SA-LSTM model using the present invention is higher than that using the conventional LSTM model. By adding a stack as a memory component, the SA-LSTM model of the present invention can extract hierarchical structural features, thereby helping the model understand the functionality of the program and generate a more accurate representation of the program. Therefore, the model of the present invention performs better.

(3) Code summary generation

Generating its natural language profile from source code is of great value in many software applications, such as code searching. It is necessary to understand the structure of the program in this task. The present invention applies the proposed model to this task and evaluates the performance of the model of the present invention. The invention adopts a sequence-to-sequence architecture, uses an SA-LSTM model as an encoder, and provides a better code representation method. The invention uses three sparse data sets: JOBS, GEO, ATIS. These data sets contain natural language queries and their logical representations. These logical representations are very similar to programming languages, both of which include an explicit hierarchical structure. In the model of the invention, SA-LSTM is used as the encoder and standard LSTM as the decoder. While for the common LSTM model, standard LSTM is used as both encoder and decoder. And both models use an attention mechanism. We use a two-layer network with a hidden unit size of 100. The embedding size is 100. Different vocabularies are employed for the encoder and decoder. To train the model, adam optimization was used, and the base learning rate was le-3. BLEU is an automatic algorithm to evaluate the quality of text. The present invention uses BLEU-4 to evaluate the generated summary. The results are as follows:

TABLE 3

Table 3 shows that the BLEU score for SA-LSTM performed better on all three data sets than the sequence-to-sequence model alone. This shows that the SA-LSTM model of the present invention enables better understanding of programs and extraction of program summaries by capturing the hierarchical structure information of the programs.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the creation apparatus of a virtual machine according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A program representation method based on stack enhancement LSTM is characterized in that:

the stack enhancement LSTM includes a stack of,

using V_startAnd V_endThe start and end symbols of a code block are represented, including the following three cases:

(1) the input character is { ", the hidden state is pushed into the stack, and then the hidden state is updated along with the unit function of the LSTM;

(2) the input character is "}", the stack is taken out of the stack to get the hidden state h_beginSaid hidden state h_beginInformation representing a section of preamble code preceding the code block; then combined with h_beginAnd h_t-1To obtain key information of alpha function, h_beginIs a hidden state generated at the beginning of the code block, h_t-1Is the hidden state of the previous time step, namely the hidden state generated at the end of the code block; storing the key information in h_contextThen using the input vector x at time step t_tAnd h_contextAs an input to a unit function to update the hidden state;

(3) for other cases, the hidden state is updated by the same steps as the standard LSTM network;

from h, with the aim of extracting valuable context information_beginAnd h_t-1The following alpha function is used in extracting context information:

(i)h_context＝fc(concat(h_begin，h_t-1))

i.e. will h_beginAnd h_t-1Associating as a new vector and then inputting it to a fully connected layer;

(ii)h_context＝maxpooling(h_begin，h_t-1)

at h_beginAnd h_t-1Making maximum pooling between the two;

(iii)h_context＝cell(h_t-1，h_begin)

hidden state h of previous time step_t-1"summary" as context; take it as a new input vector and then set the current state to h_beginThen execute the unit function to get h_contextThe cell is a hidden state update function of the LSTM;

representing the program based on the context information.

2. The stack-enhanced LSTM-based program representation method of claim 1, wherein:

the input vector for the current time step and the context information are used as inputs to the unit function to update the hidden state.

3. A method of code completion, characterized by: the method for program representation based on stack enhanced LSTM as claimed in claim 1 or 2 is applied for code completion.

4. A program classification method, characterized by: program classification is performed by applying the stack enhancement LSTM-based program representation method of claim 1 or 2.

5. A method of generating a summary of a code, characterized by: code summary generation is performed by applying the stack enhanced LSTM based program representation method of claim 1 or 2.