WO2021017025A1

WO2021017025A1 - Method for automatically generating python codes from natural language

Info

Publication number: WO2021017025A1
Application number: PCT/CN2019/099733
Authority: WO
Inventors: 祝亚兵; 张岩峰
Original assignee: 东北大学
Priority date: 2019-07-29
Filing date: 2019-08-08
Publication date: 2021-02-04
Also published as: CN110489102A; CN110489102B

Abstract

A method for automatically generating Python codes from a natural language, wherein the method belongs to the technical field of natural language processing. The method steps are as follows: step 1: generating, by means of a generator of a GAN, an abstract syntax tree of a program fragment according to a natural language description; step 2: determining, by means of a discriminator of the GAN, whether the semanteme of the abstract syntax tree generated by the generator is consistent with the semanteme described by a given natural language; and step 3: training the generator and the discriminator of the GAN together. According to the method, a code generation system is generated by means of generative adversarial network optimization training, and the system can generate, according to the natural language description of a function given by a user, a program code with the same function. Compared with a traditional optimization method, a generative adversarial network is used to perform adversarial game training, and a generator can learn language models of natural languages and programming languages more effectively.

Description

A method to automatically generate Python code from natural language

Technical field

The invention belongs to the technical field of natural language processing, and specifically relates to a method for automatically generating Python code from natural language.

Background technique

Semantic analysis tasks are a type of task in the field of natural language processing. The main research is how to convert a given natural language description text into a logical representation that a computer can understand and execute, such as SQL, Python, Java, etc. . The traditional method is to design a fixed template based on the characteristics of the programming language, and then use pattern matching to parse the natural language description into instances in the template. With the development of deep learning technology, deep learning frameworks such as Encoder-Decoder have also been introduced into semantic analysis and analysis tasks, such as the use of machine translation methods to directly translate natural description language sequences into programming language sequences, or generate code At that time, the grammar of the programming language was introduced, and the abstract syntax tree of the program was first generated, and then the abstract syntax tree was converted into program code. However, when the above Encoder-Decoder model deals with the conversion from natural language to programming language, Encoder and Decoder deal with two different languages respectively. Due to the different neural networks used by Encoder and Decoder, as well as the depth of the network, natural language description The semantics of is gradually lost in the process of program code generation, so there is a lack of a training model with strong semantic constraints.

Summary of the invention

In view of the above problems, the present invention proposes a method for automatically generating Python code from natural language. The invention aims to improve the effect of the generator generating program fragments according to the natural language description through the discriminator, and learn the connection between the distribution of the natural language and the programming language.

The technical scheme of the present invention is:

A method to automatically generate Python code from natural language, the steps are as follows:

Step 1: Use the generator of the GAN network to generate the abstract syntax tree of the program fragments according to the natural language description.

The generator is an Encoder-Decoder deep learning framework. The Encoder is responsible for encoding the natural language description sequence. The Decoder decodes the semantics of the natural language description into an abstract syntax tree of program fragments according to the encoding result of the Encoder.

Step 1.1: Use the two-way LSTM network as the Encoder to encode the natural language description sequence;

Step 1.1.1: Encode the natural language description sequence from left to right and from right to left to obtain the intermediate hidden vector of each character

Step 1.1.2: Hide the vector in the middle

Perform concat operation

That is, the encoding vector of the natural language description character, and the encoding vector of each character is saved for later use by the decoder.

Step 1.1.3: Use the middle hidden vector of the last character as the initial state h _{end of the} Decoder.

Step 1.2: Use the one-way LSTM network as the decoder, and construct the natural language semantic decoding encoded by the Encoder into the abstract syntax tree of the program.

This step introduces the grammatical rules of the programming language into the generation process. The abstract syntax tree is generated in a depth-first traversal manner. Each generation step is an application of context-free grammar production. The grammar rules provide prior knowledge for the generation of abstract syntax trees and reduce the search space.

Step 1.2.1: Use h _{end in} 1.1.3 as the initial state of the Decoder, and use the attention mechanism to calculate the content vector of h _end , and then use the content vector as the input of LSTM.

Step 1.2.2: Use Softmax to multi-categorize the 1.2.1 LSTM output results, and these categories correspond to the action of generating an abstract syntax tree.

Step 1.2.3: For the action of the abstract syntax tree generated in 1.2.2, one type of this action is to generate leaf nodes, and the other type of action is to generate non-leaf nodes.

The action of generating non-leaf nodes is a context-free grammar extension; while the action of generating leaf nodes is to generate specific characters, that is, sequence characters in program fragments, which can be copied from natural language description sequences Copy the characters, you can also generate the corresponding characters according to the model.

Step 1.2.4: Apply the action of the abstract syntax tree of 1.2.3 to construct an abstract syntax tree in a depth-first traversal manner.

Step 1.2.5: Take the output of 1.2.4 as the input of 1.2.1, repeat the operations from 1.2.1 to 1.2.4, and finally get a complete abstract syntax tree, that is, the abstraction of the program fragment corresponding to the natural language description semantics Law tree.

Step 1.2.6: Parse the abstract syntax tree into program fragments.

Step 2: Use the GAN discriminator to determine whether the semantics of the abstract syntax tree generated by the generator is consistent with the semantics of the given natural language description, which is also a strong semantic constraint on the generator. The data for training the discriminator is divided into three types: A natural language description in the training data and the abstract syntax tree of the corresponding program. B Given a natural language description and an abstract syntax tree generated by the generator. C natural language describes the abstract syntax tree of sequences and unrelated programs. The given label for training data A is consistent, while the given label for training data B and C is inconsistent.

Step 2.1: Use the Encoder method in the GAN generator to encode the natural language description sequence. This step only needs to get the final semantic vector.

Step 2.2: Use the tree-type LSTM network to encode the abstract syntax tree from the bottom up, and encode it to the root node of the abstract syntax tree, which is the semantic vector corresponding to the abstract syntax tree.

Step 2.3: Perform vector multiplication on the natural language semantic vector in 2.1 and 2.2 and the semantic vector of the abstract syntax tree.

Step 2.4: Repeat 2.1 and 2.3 to perform the same operation on training data B and training data C in step 2.

Step 2.5: Perform two-class prediction on the training data pair in 2.4, and judge whether the semantics of the natural language and the program abstract syntax tree are consistent in these three cases.

Step 3: Train GANCoder, train the generator and discriminator of the GAN network together. During optimization, the generator and the discriminator are optimized alternately. Before training, the generator and the discriminator are pre-trained separately, and then the game is trained together.

Further, the model GANCoder generated by a method of automatically generating Python code from natural language contains two parts: a generator and a discriminator, where the generator is responsible for generating program fragments from natural language to programming language, and the discriminator is Identify the program fragments generated by the generator. During training, the generator and the discriminator are in a state of game training, and they improve each other. In the end, the discriminator cannot identify whether the programming language program fragment is the data of the original training set or the data generated by the generator.

The beneficial effects of the present invention:

The present invention generates a code generation system by generating a confrontation network optimization training, and the system can generate a section of program code with the same function according to the natural language description of a function given by the user. Compared with traditional optimization methods, using generative adversarial network for adversarial game training, the generator can learn the language models of natural language and programming language more effectively.

Description of the drawings

Figure 1 is a semantic analyzer based on the Encoder-Decoder model.

Figure 2 is an abstract syntax tree corresponding to a Python program.

Figure 3 is the overall framework of the GANCoder of the present invention.

Figure 4 is a diagram showing the framework of the GANCoder generator.

Figure 5 is the use of tree LSTM network to encode the abstract method tree.

Detailed ways

The specific embodiments of the present invention will be described in detail below in conjunction with the technical solution and the drawings.

A method to automatically generate Python code from natural language. The proposed GANCoder system is generally a generative confrontation network, including two parts: a generator and a discriminator, as shown in Figure 3. The generator is an Encoder-Decoder model. As shown in Figure 4, the Encoder is responsible for encoding the natural language description sequence using a two-way LSTM network, while the Decoder decodes the semantics encoded by the Encoder into the abstract syntax tree of the program, using one-way LSTM network; and the discriminator is mainly responsible for judging whether the semantics of natural language description and abstract syntax tree are consistent. For the semantic encoding of natural language description, the generator Encoder is used, and the tree-type LSTM network is used for the encoding of the abstract syntax tree. As shown in Figure 5, the abstract syntax tree of the program is coded in a bottom-up manner, and the coding vector of the root node of the abstract syntax tree is the semantic vector of the abstract syntax tree.

The generator is an Encoder-Decoder deep learning model, as shown in Figure 4. The left side of the figure is Encoder, which is a two-way LSTM network, which is responsible for encoding natural language description sequences; the right side of the figure is Decoder, which is a one-way LSTM network. , It decodes the semantics described by natural language into abstract syntax trees of program fragments according to the encoding result of Encoder.

Step 1.1: Use the two-way LSTM network as the Encoder to encode the natural language description sequence. The left and right directions in the Encoder in Figure 4 represent the encoding order of the LSTM network.

The two encoding directions of the LSTM network in Figure 4 Encoder are the same.

Step 1.1.2: Change 1.1

Perform concat operation and get

This step introduces the grammatical rules of the programming language into the code generation process. The abstract syntax tree is generated in a depth-first traversal manner. Each generation step is an application of context-free grammar production. The grammar rules provide prior knowledge for the generation of abstract syntax trees and reduce the search space.

Step 1.2.1: As shown in Figure 4, the Decoder takes h _{end in} 1.1.3 as the initial state, and uses the attention mechanism to calculate the content vector C1 of h _end , and then uses the content vector as the input of LSTM.

Step 1.2.2: Use Softmax to multi-categorize the LSTM output results. These categories correspond to the action of generating the abstract syntax tree, corresponding to each node of the abstract syntax tree in the right figure in Figure 2.

Step 1.2.3: For the actions predicted in 1.2.2, one is to generate leaf nodes, and the other is to generate non-leaf nodes, that is, the leaf nodes and non-leaf nodes in the abstract syntax tree in Figure 2. For the action of generating non-leaf nodes, it is a context-free grammar extension, each of which is a context grammar rule; while the action of generating leaf nodes is to generate specific characters, that is, sequence characters in program fragments, which can be copied Ways to copy characters from the natural language description sequence, or to generate corresponding characters based on the model.

Step 1.2.4: Apply the 1.2.3 prediction action to construct an abstract syntax tree in a depth-first traversal manner. The order in which the nodes of the abstract syntax tree in Figure 2 are represented by solid arrows is the order in which each node in the abstract syntax tree is built.

Step 1.2.5: Use the output of 1.2.4 as the input of 1.2.1, as shown in Figure 2. The information of the previous node is passed to the next node, and the information includes the state of the previous step, which is indicated by the solid arrow , There is also the information of the parent node, the information conveyed by the dotted arrow. Then repeat the operations from 1.2.1 to 1.2.4, and finally get a complete abstract syntax tree, that is, the abstract syntax tree of the program fragment corresponding to the natural language description semantics.

Step 1.2.6: Parse the complete abstract syntax tree into program fragments.

Step 2: Use the GAN discriminator to determine whether the semantics of the abstract syntax tree generated by the generator is consistent with the semantics of the given natural language description, which is also a strong semantic constraint on the generator. The training discriminator data is divided into three types: 1. The natural language description in the training data and the abstract syntax tree of the corresponding program. 2. Given the natural language description and the abstract syntax tree generated by the generator. 3. Natural language description sequence and abstract syntax tree of unrelated programs. For 1, the given label is consistent, and for two data, the given label is inconsistent.

Step 2.1: Use the Encoder method in the GAN generator to encode the natural language description sequence. As long as the final semantic vector is obtained in this step, the Encoder structure is shown in Figure 4.

Step 2.2: Adopt a tree-type LSTM network, as shown in Figure 5, to encode the abstract syntax tree from the bottom up. The child nodes of the abstract syntax tree are the input of the parent node encoding, and it is encoded to the root node of the abstract syntax tree. It is the semantic vector corresponding to this abstract syntax tree.

Step 2.4: Repeat 2.1 and 2.3 to perform the same operation on training data 2 and training data 3 in step 2.

Step 3: Train GANCoder, train the generator and discriminator of the GAN network together. During optimization, the generator and the discriminator are optimized alternately. Before training, the generator and the discriminator are pre-trained, and then the game is trained together. As shown in Figure 3, the discriminator's information will be fed back to the generator.

Claims

A method for automatically generating Python code from natural language, characterized in that the steps are as follows:

Step 1: Use the generator of the GAN network to generate the abstract syntax tree of the program fragment according to the natural language description;

Step 1.1: Use the two-way LSTM network as the Encoder to encode the natural language description sequence;

Step 1.1.1: Encode the natural language description sequence from left to right and from right to left to obtain the intermediate hidden vector of each character

Step 1.1.2: Hide the vector in the middle
Perform concat operation
That is, the encoding vector of the natural language description character, and the encoding vector of each character is saved for later use by the decoder;

Step 1.1.3: Use the middle hidden vector of the last character as the initial state h end of the Decoder;

Step 1.2: Use the one-way LSTM network as the decoder, and construct the natural language semantic decoding encoded by the Encoder into the abstract syntax tree of the program;

Step 2: The data for training the discriminator is divided into three types: A natural language description in the training data and the abstract syntax tree of the corresponding program; B a given natural language description and abstract syntax tree generated by the generator; C natural language Abstract syntax tree describing sequences and unrelated programs;

For training data A, the given labels are consistent; for training data B and C, the given labels are inconsistent;

Step 2.1: Use the Encoder method in the GAN generator to encode the natural language description sequence;

Step 2.2: Use a tree-type LSTM network to encode the abstract syntax tree from the bottom up, and encode it to the root node of the abstract syntax tree;

Step 2.3: Perform vector multiplication on the natural language semantic vector in 2.1 and 2.2 and the semantic vector of the abstract syntax tree;

Step 2.4: Repeat 2.1 and 2.3 to perform the same operation on training data B and training data C in step 2;

Step 2.5: Perform two-class prediction on the training data pair in 2.4, and judge whether the semantics of the natural language and the program abstract syntax tree are consistent in these three cases;

Step 3: Train the generator and discriminator of the GAN network together, and optimize the generator and discriminator alternately.
The method for automatically generating Python code from natural language according to claim 1, wherein the specific method of step 1.2 is as follows:

Step 1.2.1: Use the initial state h end in 1.1.3 as the initial state of the Decoder, and use the attention mechanism to calculate the content vector of h end , and then use the content vector as the input of LSTM;

Step 1.2.2: Use Softmax to multi-categorize the 1.2.1 LSTM output results, and these categories correspond to the actions of generating abstract syntax trees;

Step 1.2.3: For the action of the abstract syntax tree generated by 1.2.2, one type of this action is to generate leaf nodes, and the other type of action is to generate non-leaf nodes;

Step 1.2.4: Apply the actions of the abstract syntax tree of 1.2.3 to construct an abstract syntax tree in a depth-first traversal manner;

Step 1.2.5: Take the output of 1.2.4 as the input of 1.2.1, repeat the operations from 1.2.1 to 1.2.4, and finally get a complete abstract syntax tree, that is, the abstraction of the program fragment corresponding to the natural language description semantics Law tree

Step 1.2.6: Parse the abstract syntax tree into program fragments.
The method for automatically generating Python code from natural language according to claim 1 or 2, characterized in that, in said step 3, before training the generator and the discriminator, the generator and the discriminator are respectively pre-trained , And then game training together.
The model generated by the method for automatically generating Python code from natural language according to any one of claims 1 to 3 includes two parts: a generator and a discriminator, where the generator is responsible for generating program fragments from natural language to programming language, The discriminator recognizes the program fragments generated by the generator; during training, the generator and the discriminator are in a game training state and improve each other. In the end, the discriminator cannot identify whether the programming language program fragment is the original training set data or generated Data generated by the device.