CN115202640A

CN115202640A - Code generation method and system based on natural semantic understanding

Info

Publication number: CN115202640A
Application number: CN202210886402.0A
Authority: CN
Inventors: 乐心怡; 王骥泽; 陈彩莲; 关新平
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-10-18

Abstract

The invention provides a code generation method and a system based on natural semantic understanding, which comprises the following steps: step S1: acquiring a natural language sequence, and preprocessing the acquired natural language sequence to obtain a preprocessed natural language sequence; step S2: constructing a code generation model, and training the constructed code generation model to obtain a trained code generation model; and step S3: generating a target code by the trained code generation model according to the input natural language sequence; the code generation model is based on a neural network structure of a coder-decoder, and realizes automatic generation from a natural language sequence to target codes.

Description

Code generation method and system based on natural semantic understanding

Technical Field

The invention relates to the technical field of natural language processing, in particular to a code generation method and system based on natural semantic understanding.

Background

Semantic parsing tasks are a class of tasks in the field of natural language processing, and the main study is how to convert a given natural language description text into a logical representation that can be understood and executed by a computer. The traditional method is to design a fixed template according to the characteristics of a programming language, and then analyze the natural language description into individual examples in the template by using a mode of pattern matching. With the development of deep learning technology, the deep learning framework such as Encoder-Decoder is also introduced into semantic parsing tasks, for example, a machine translation method is adopted to translate a natural description language sequence into a programming language sequence directly, or when a code is generated, syntax of the programming language is introduced, an abstract syntax tree of a program is generated first, and then the abstract syntax tree is converted into a program code. However, these methods have some drawbacks, such as that the method of directly using machine translation has a high requirement on the scale of the accurate labeled data, and the method of using the abstract syntax tree requires more manual design, fails to introduce knowledge in the natural language field, and so on.

Yin P,Neubig G.TRANX:A trans ition-based neural abstract syntax parser for semantic pars ing and code generation[J].arXiv preprint arXiv:1810.02720,2018.

Lewis M,Liu Y,Goyal N,et al.Bart:Denoi sing sequence-to-sequence pre-training for natural language generation,translation,and comprehension[J].arXiv preprint arXiv:1910.13461,2019.

Norouzi S,Tang K,Cao Y.Code Generation from Natural Language with Less Prior Knowledge and More Monol ingual Data[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguist ics and the 11th International Joint Conference on Natural Language Processing(Volume 2:Short Papers).2021:776-785.

The above prior art is a code generation technology based on deep learning, but the prior art has the following disadvantages: the neural network model structure is designed by combining the abstract syntax tree, more manual design needs to be added in the aspects of data processing and the model structure, external label-free data is not easy to introduce, and the code generation capacity is limited; the Bart model is utilized to pre-train a large number of codes and corresponding document data from zero, the requirements on data scale and computing resources are high, and the time consumed by model training is long; and only in the fine tuning stage, the code data in the inaccurate data are marked to enhance the data, and the corresponding natural language is not marked, so that the semantic association between the natural language and the program language is not favorably learned by the model.

Patent document CN110489102B (application No.: 201910689490.3) discloses a method of automatically generating Python code from natural language. The method comprises the following steps: step 1: and generating an abstract syntax tree of the program segment according to the natural language description by adopting a generator of the GAN network. And 2, step: a discriminator using GAN determines whether the semantics of the abstract syntax tree generated by the generator are consistent with the semantics of a given natural language description. And step 3: the generator and the arbiter of the GAN network are trained together.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a code generation method and system based on natural semantic understanding.

The invention provides a code generation method based on natural semantic understanding, which comprises the following steps:

step S1: acquiring a natural language sequence, and preprocessing the acquired natural language sequence to obtain a preprocessed natural language sequence;

step S2: constructing a code generation model, and training the constructed code generation model to obtain a trained code generation model;

and step S3: the trained code generation model generates target codes according to the input natural language sequence

The code generation model is based on a neural network structure of a coder-decoder, and automatic generation from a natural language sequence to target codes is realized.

Preferably, the step S1 employs: and acquiring a natural language sequence, and performing word segmentation operation and vectorization processing on the natural language sequence in sequence by using a word list and a word segmentation device of the BERT to obtain a word vector.

Preferably, the encoder in the code generation model adopts a BERT model to form a BERT encoder;

a decoder in the code generation model adopts a plurality of layers of transformers to form a Transformer decoder;

the BERT encoder performs semantic feature extraction on input word vectors through a multi-head self-attention mechanism and a feedforward neural network to obtain deep semantic representation of natural language;

the Transformer decoder performs reasoning circularly through a multi-head self-attention mechanism with a mask, a cross-attention mechanism and a feedforward neural network according to the depth semantic representation of the natural language extracted by the BERT encoder to obtain the decoder output.

Preferably, the step S2 employs:

step S2.1: expanding a training set based on the data with inaccurate labeling, and performing bidirectional pre-training on a code generation model to be trained by using the expanded training set;

the annotated inaccurate data is represented as:

wherein x is _i Representing a natural language annotation; y is _i Representing an object code; n represents the total number of samples marked with inaccurate data;

the method for expanding the training set based on the inaccurate data of the label comprises the following steps:

the two-way pre-training loss function employs:

wherein, F (x; theta) represents the probability output obtained after the input x is operated by an encoder, a decoder and softmax of a code generation model;

step S2.2: carrying out self-coding fine adjustment on the code generation model subjected to bidirectional pre-training by using label-free data and label accurate data to obtain a trained code generation model;

the annotation accurate data is represented as:

wherein x is _i Representing natural language markup, y _i Representing a target code, and C representing the total number of accurately labeled samples;

the label-free data is represented as:

wherein, y _i 'is code data without label, and C' is total number of samples without label;

the loss function for the fine tuning phase is then:

preferably, the step S3 employs:

step S3.1: performing word segmentation and vectorization processing on an input natural language sequence to obtain a word vector;

step S3.2: inputting the word vector into a trained BERT coder to extract semantic features to obtain the deep semantics of the natural language;

step S3.3: and performing circular reasoning by the trained transform decoder according to the extracted deep semantics of the natural language, obtaining the occurrence probability of each word in the word list at each time step, and obtaining a code sequence with the maximum occurrence probability as a target code through code search.

According to the invention, the code generation system based on natural semantic understanding comprises:

a module M1: acquiring a natural language sequence, and preprocessing the acquired natural language sequence to obtain a preprocessed natural language sequence;

a module M2: constructing a code generation model, and training the constructed code generation model to obtain a trained code generation model;

a module M3: the trained code generation model generates target codes according to the input natural language sequence

The code generation model is based on a neural network structure of a coder-decoder, and realizes automatic generation from a natural language sequence to target codes.

Preferably, the module M1 employs: and acquiring a natural language sequence, and performing word segmentation operation and vectorization processing on the natural language sequence in sequence by using a word list and a word segmentation device of the BERT to obtain a word vector.

Preferably, the coder in the code generation model adopts a BERT model to form a BERT coder;

Preferably, the module M2 employs:

module M2.1: expanding a training set based on the inaccurate data of the label, and performing bidirectional pre-training on the code generation model to be trained by utilizing the expanded training set;

the annotated inaccuracy data is represented as:

wherein x is _i Representing a natural language annotation; y is _i Representing an object code; n represents the annotated inaccurate dataThe total number of samples of (c);

the two-way pre-training loss function employs:

f (x; theta) represents the probability output obtained after input x is subjected to the operation of a coder, a decoder and softmax of a code generation model;

module M2.2: carrying out self-coding fine adjustment on the code generation model subjected to bidirectional pre-training by using label-free data and label accurate data to obtain a trained code generation model;

the annotated accurate data is represented as:

wherein x is _i Representing natural language markup, y _i Representing a target code, and C representing the total number of samples with accurate labeling;

the label-free data is represented as:

the loss function for the fine tuning phase is then:

preferably, the module M3 employs:

module M3.1: performing word segmentation and vectorization processing on an input natural language sequence to obtain a word vector;

module M3.2: inputting the word vector into a trained BERT coder to extract semantic features to obtain the deep semantics of the natural language;

module M3.3: and the trained Transformer decoder performs cycle reasoning according to the extracted depth semantics of the natural language, obtains the occurrence probability of each word in the word list at each time step, and obtains a code sequence with the maximum occurrence probability as a target code through code search.

Compared with the prior art, the invention has the following beneficial effects:

1. by adopting the neural network model structure of the encoder-decoder, the problem of code generation based on natural semantic understanding is solved, the software development efficiency is improved, and the labor cost is reduced;

2. the invention extracts the deep semantic representation of the natural language by adopting the pre-training model BERT as the coder, thereby solving the problem of introduction of natural language field knowledge in the code generation task.

3. The invention adopts a model training method based on a pre-training-fine-tuning paradigm, fully utilizes the supervision information in the limited data, and solves the problems of limited and expensive labeled accurate data in the field of code generation.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

fig. 1 is a block diagram of a coder-decoder network based on the BERT model.

FIG. 2 is a model training flow diagram.

Fig. 3 is a schematic diagram of bi-directional pre-training.

Fig. 4 is a diagram of a self-encoding fine tuning schematic.

Fig. 5 is a diagram of decoder inference process.

FIG. 6 is a schematic diagram of cycle reasoning.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the concept of the invention. All falling within the scope of the invention.

Example 1

According to the code generation method based on natural semantic understanding provided by the invention, as shown in fig. 1-6, the method comprises the following steps:

specifically, the step S1 employs: and acquiring a natural language sequence, and performing word segmentation operation and vectorization processing on the natural language sequence in sequence by using a word list and a word segmentation device of the BERT to obtain a word vector.

Specifically, an encoder in the code generation model adopts a BERT model to form a BERT encoder;

Specifically, the step S2 employs:

step S2.1: expanding a training set based on the inaccurate data of the label, and performing bidirectional pre-training on the code generation model to be trained by utilizing the expanded training set;

specifically, the step S2.1 employs:

the annotated inaccurate data is represented as:

the two-way pre-training loss function employs:

wherein F (x; theta) represents the probability output obtained after the input x is subjected to the encoder, the decoder and the softmax operation of the code generation model. The softmax operation converts decoder output into probabilities;

specifically, the formula of the softmax operation is as follows:

wherein S is the length of the word list.

Step S2.2: and carrying out self-coding fine adjustment on the code generation model subjected to bidirectional pre-training by using the label-free data and the label accurate data to obtain the trained code generation model.

Specifically, the step S2.2 employs:

the annotated accurate data is represented as:

the label-free data is represented as:

wherein, y _i 'is code data without label, C' is total number of samples without label;

the loss function for the fine tuning phase is then:

specifically, the step S3 employs:

step S3.3: and the trained Transformer decoder performs cycle reasoning according to the extracted depth semantics of the natural language, obtains the occurrence probability of each word in the word list at each time step, and obtains a code sequence with the maximum occurrence probability as a target code through code search. The code searching method comprises the following steps: greedy search or bundle search methods. The specific codes are as follows:

input the natural language sequence "comment is an estimate list"

Output target code "comment = [ ]"

According to the code generation system based on natural semantic understanding provided by the invention, as shown in fig. 1-6, the code generation system comprises:

specifically, the module M1 employs: and acquiring a natural language sequence, and performing word segmentation operation and vectorization processing on the natural language sequence in sequence by using a word list and a word segmentation device of the BERT to obtain a word vector.

and the transform decoder performs reasoning circularly to obtain decoder output through a multi-head self-attention mechanism with a mask, a cross attention mechanism and a feedforward neural network according to the depth semantic representation of the natural language extracted by the BERT encoder.

Specifically, the module M2 employs:

specifically, the module M2.1 employs:

the annotated inaccuracy data is represented as:

wherein x is _i Representing a natural language annotation; y is _i Representing an object code; n represents the total number of samples marked with inaccurate data; the method for expanding the training set based on the inaccurate data of the label comprises the following steps:

the two-way pre-training loss function employs:

wherein F (x; theta) represents the probability output obtained after the input x is subjected to the encoder, the decoder and the softmax operation of the code generation model. The softmax operation converts decoder output to probabilities;

specifically, the formula of the softmax operation is as follows:

wherein S is the length of the word list.

Module M2.2: and carrying out self-coding fine adjustment on the code generation model subjected to bidirectional pre-training by using the label-free data and the label accurate data to obtain the trained code generation model.

In particular, the module M2.2 employs:

the annotated accurate data is represented as:

the label-free data is represented as:

the penalty function for the fine tuning phase is then:

specifically, the module M3 employs:

module M3.3: and performing circular reasoning by the trained transform decoder according to the extracted deep semantics of the natural language, obtaining the occurrence probability of each word in the word list at each time step, and obtaining a code sequence with the maximum occurrence probability as a target code through code search. The code searching method comprises the following steps: greedy search or bundle search methods. The specific code is as follows:

input the natural language sequence "comment is an estimate list"

Output object code "comment = [ ]"

Example 2

Example 2 is a preferred example of example 1

The invention provides a code generation method based on natural semantic understanding, which realizes automatic generation from natural language requirement description to target codes, improves software development efficiency and reduces labor cost. The deep semantic representation of the natural language is extracted through the BERT encoder, and the problem of introduction of domain knowledge is solved. A novel model training method based on a pre-training-fine-tuning paradigm is provided, supervision information in limited data is fully utilized, requirements for data scale and computing resources are reduced, and training time is shortened.

The invention provides a code generation system based on natural semantic understanding, which comprises:

a data processing module: the method comprises the steps of inputting natural language requirement description as a code generation model, outputting corresponding codes as the code generation model, and adopting a training data form as a 'natural language-program language' data pair. The training data is divided into two parts, one part is accurate marking data, code segments are selected and manually marked according to natural language requirements, and therefore marking accuracy is guaranteed; the other part is the inaccurate data of the label. A small amount of accurately labeled data and a large amount of inaccurately labeled data jointly form training data, so that the data volume is sufficient, and the acquisition cost is not too high. The inaccurate data of the label crawls codes and corresponding labels through the Internet so as to reduce the labor cost; replacing codes by crawling the unmarked data through the Internet; and the accurate data is marked manually by crawling codes through the Internet.

A network structure module: to handle the problem of unequal length of input and output sequences, the code generation model employs an overall structure of encoder-decoder, where the encoder employs a pre-trained BERT model and the decoder employs multiple layers of transformers. The input of the whole code generation model is a natural language sequence, and in order to enable the code generation model to understand the input sequence, word segmentation operation needs to be carried out on the input sequence firstly, and then vectorization is carried out, and a series of obtained word vectors are used as the input of the encoder.

The BERT encoder performs semantic feature extraction on input word vectors through a multi-head self-attention mechanism and a feedforward neural network to obtain deep semantic representation of natural language. The transform decoder performs reasoning circularly through a multi-head self-attention mechanism with a mask, a cross-attention mechanism and a feedforward neural network according to the depth semantic representation of the natural language extracted by the encoder to obtain the decoder output. Thereby obtaining a code generation model to be trained.

A training module: firstly, bidirectional pre-training is carried out on a code generation model to be trained by using inaccurate data of label, so that the code generation model can better understand the relationship between natural language and program language. Because the deep learning model is unidirectional and cannot change direction, the bidirectional pre-training method exchanges the input and output of the data pair of natural language-programming language, so that the model learns the conversion of two directions simultaneously.

The annotated inaccuracy data in the training data may be expressed as:

wherein x is _i For natural language notation, y _i For the object code, N is the total number of samples labeled with inaccurate data.

In bi-directional pre-training, we extend the training set to:

the loss function of the two-way pre-training phase is then:

wherein, F (x; theta) represents the probability output obtained after the input x is operated by an encoder, a decoder and softmax of a model.

And then, carrying out self-coding fine adjustment on the model subjected to bidirectional pre-training by using the unmarked data and the marked accurate data. Wherein, the code part in the inaccurate data can be directly utilized by the non-labeled data.

The annotated accuracy data used for fine tuning can be expressed as:

wherein x is _i For natural language notation, y _i Is the target code, and C is the total number of samples labeled accurately.

The label-free data can be expressed as:

wherein, y _i 'is code data without label, and C' is the total number of samples without label.

The loss function for the fine tuning phase is then:

and obtaining a trained code generation model through a pre-training stage and a fine-tuning stage.

An inference module: the trained code generation model is used in a code generation task based on natural semantic understanding. Giving a natural language requirement description, carrying out word segmentation and vectorization on the natural language sequence, inputting the obtained word vector into a BERT encoder, carrying out semantic feature extraction on the input word vector by the BERT encoder, and carrying out cyclic inference by a transform decoder according to the extracted semantic features to obtain decoder output. The output of the decoder is the probability of possible occurrence of each word in each time step word list, the output is subjected to greedy search or bundle search to obtain a word vector sequence with the maximum occurrence probability, the word vector sequence is converted into a corresponding word sequence according to the word list, and finally the target code is obtained.

According to the invention, the domain knowledge of natural language and programming language can be respectively introduced through the BERT model and external data, so that more manual design is avoided; the model is trained by a training method based on a pre-training-fine-tuning paradigm, limited marking data are fully utilized, and the BERT model is pre-trained on a large English corpus, so that the requirements for data scale and computing resources are reduced, and the model training time is shortened. Meanwhile, the model is subjected to bidirectional pre-training before the fine-tuning stage, and the supervision information in the inaccurate data of the label is fully utilized, so that the model learns the semantic association between the natural language and the program language.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A code generation method based on natural semantic understanding is characterized by comprising the following steps:

and step S3: the trained code generation model generates a target code according to an input natural language sequence;

2. The code generation method based on natural semantic understanding according to claim 1, wherein the step S1 employs: and acquiring a natural language sequence, and performing word segmentation operation and vectorization processing on the natural language sequence in sequence by using a word list and a word segmentation device of the BERT to obtain a word vector.

3. The code generation method based on natural semantic understanding according to claim 1, wherein the coder in the code generation model adopts a BERT model to form a BERT coder;

4. The code generation method based on natural semantic understanding according to claim 1, wherein the step S2 employs:

the annotated inaccurate data is represented as:

the two-way pre-training loss function employs:

the annotated accurate data is represented as:

the label-free data is represented as:

the loss function for the fine tuning phase is then:

5. the code generation method based on natural semantic understanding according to claim 1, wherein the step S3 employs:

step S3.3: and the trained Transformer decoder performs cycle reasoning according to the extracted depth semantics of the natural language, obtains the occurrence probability of each word in the word list at each time step, and obtains a code sequence with the maximum occurrence probability as a target code through code search.

6. A code generation system based on natural semantic understanding, comprising:

a module M3: generating a target code by the trained code generation model according to the input natural language sequence;

7. The natural semantic understanding-based code generation system according to claim 6, wherein the module M1 adopts: and acquiring a natural language sequence, and performing word segmentation operation and vectorization processing on the natural language sequence in sequence by using a word list and a word segmentation device of the BERT to obtain a word vector.

8. The code generation system based on natural semantic understanding according to claim 6, wherein the coder in the code generation model adopts a BERT model to form a BERT coder;

9. A natural semantic understanding-based code generation system according to claim 6, wherein the module M2 employs:

the annotated inaccurate data is represented as:

the two-way pre-training loss function employs:

module M2.2: carrying out self-coding fine adjustment on the code generation model subjected to bidirectional pre-training by using the label-free data and the label accurate data to obtain a trained code generation model;

the annotated accurate data is represented as:

the label-free data is represented as:

the penalty function for the fine tuning phase is then:

10. a natural semantic understanding-based code generation system according to claim 6, wherein the module M3 employs: