CN115202640A - Code generation method and system based on natural semantic understanding - Google Patents

Code generation method and system based on natural semantic understanding Download PDF

Info

Publication number
CN115202640A
CN115202640A CN202210886402.0A CN202210886402A CN115202640A CN 115202640 A CN115202640 A CN 115202640A CN 202210886402 A CN202210886402 A CN 202210886402A CN 115202640 A CN115202640 A CN 115202640A
Authority
CN
China
Prior art keywords
code generation
natural language
code
generation model
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210886402.0A
Other languages
Chinese (zh)
Inventor
乐心怡
王骥泽
陈彩莲
关新平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202210886402.0A priority Critical patent/CN115202640A/en
Publication of CN115202640A publication Critical patent/CN115202640A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/423Preprocessors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/425Lexical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a code generation method and a system based on natural semantic understanding, which comprises the following steps: step S1: acquiring a natural language sequence, and preprocessing the acquired natural language sequence to obtain a preprocessed natural language sequence; step S2: constructing a code generation model, and training the constructed code generation model to obtain a trained code generation model; and step S3: generating a target code by the trained code generation model according to the input natural language sequence; the code generation model is based on a neural network structure of a coder-decoder, and realizes automatic generation from a natural language sequence to target codes.

Description

Code generation method and system based on natural semantic understanding
Technical Field
The invention relates to the technical field of natural language processing, in particular to a code generation method and system based on natural semantic understanding.
Background
Semantic parsing tasks are a class of tasks in the field of natural language processing, and the main study is how to convert a given natural language description text into a logical representation that can be understood and executed by a computer. The traditional method is to design a fixed template according to the characteristics of a programming language, and then analyze the natural language description into individual examples in the template by using a mode of pattern matching. With the development of deep learning technology, the deep learning framework such as Encoder-Decoder is also introduced into semantic parsing tasks, for example, a machine translation method is adopted to translate a natural description language sequence into a programming language sequence directly, or when a code is generated, syntax of the programming language is introduced, an abstract syntax tree of a program is generated first, and then the abstract syntax tree is converted into a program code. However, these methods have some drawbacks, such as that the method of directly using machine translation has a high requirement on the scale of the accurate labeled data, and the method of using the abstract syntax tree requires more manual design, fails to introduce knowledge in the natural language field, and so on.
Yin P,Neubig G.TRANX:A trans ition-based neural abstract syntax parser for semantic pars ing and code generation[J].arXiv preprint arXiv:1810.02720,2018.
Lewis M,Liu Y,Goyal N,et al.Bart:Denoi sing sequence-to-sequence pre-training for natural language generation,translation,and comprehension[J].arXiv preprint arXiv:1910.13461,2019.
Norouzi S,Tang K,Cao Y.Code Generation from Natural Language with Less Prior Knowledge and More Monol ingual Data[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguist ics and the 11th International Joint Conference on Natural Language Processing(Volume 2:Short Papers).2021:776-785.
The above prior art is a code generation technology based on deep learning, but the prior art has the following disadvantages: the neural network model structure is designed by combining the abstract syntax tree, more manual design needs to be added in the aspects of data processing and the model structure, external label-free data is not easy to introduce, and the code generation capacity is limited; the Bart model is utilized to pre-train a large number of codes and corresponding document data from zero, the requirements on data scale and computing resources are high, and the time consumed by model training is long; and only in the fine tuning stage, the code data in the inaccurate data are marked to enhance the data, and the corresponding natural language is not marked, so that the semantic association between the natural language and the program language is not favorably learned by the model.
Patent document CN110489102B (application No.: 201910689490.3) discloses a method of automatically generating Python code from natural language. The method comprises the following steps: step 1: and generating an abstract syntax tree of the program segment according to the natural language description by adopting a generator of the GAN network. And 2, step: a discriminator using GAN determines whether the semantics of the abstract syntax tree generated by the generator are consistent with the semantics of a given natural language description. And step 3: the generator and the arbiter of the GAN network are trained together.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a code generation method and system based on natural semantic understanding.
The invention provides a code generation method based on natural semantic understanding, which comprises the following steps:
step S1: acquiring a natural language sequence, and preprocessing the acquired natural language sequence to obtain a preprocessed natural language sequence;
step S2: constructing a code generation model, and training the constructed code generation model to obtain a trained code generation model;
and step S3: the trained code generation model generates target codes according to the input natural language sequence
The code generation model is based on a neural network structure of a coder-decoder, and automatic generation from a natural language sequence to target codes is realized.
Preferably, the step S1 employs: and acquiring a natural language sequence, and performing word segmentation operation and vectorization processing on the natural language sequence in sequence by using a word list and a word segmentation device of the BERT to obtain a word vector.
Preferably, the encoder in the code generation model adopts a BERT model to form a BERT encoder;
a decoder in the code generation model adopts a plurality of layers of transformers to form a Transformer decoder;
the BERT encoder performs semantic feature extraction on input word vectors through a multi-head self-attention mechanism and a feedforward neural network to obtain deep semantic representation of natural language;
the Transformer decoder performs reasoning circularly through a multi-head self-attention mechanism with a mask, a cross-attention mechanism and a feedforward neural network according to the depth semantic representation of the natural language extracted by the BERT encoder to obtain the decoder output.
Preferably, the step S2 employs:
step S2.1: expanding a training set based on the data with inaccurate labeling, and performing bidirectional pre-training on a code generation model to be trained by using the expanded training set;
the annotated inaccurate data is represented as:
Figure BDA0003765960100000031
wherein x is i Representing a natural language annotation; y is i Representing an object code; n represents the total number of samples marked with inaccurate data;
the method for expanding the training set based on the inaccurate data of the label comprises the following steps:
Figure BDA0003765960100000032
the two-way pre-training loss function employs:
Figure BDA0003765960100000033
wherein, F (x; theta) represents the probability output obtained after the input x is operated by an encoder, a decoder and softmax of a code generation model;
step S2.2: carrying out self-coding fine adjustment on the code generation model subjected to bidirectional pre-training by using label-free data and label accurate data to obtain a trained code generation model;
the annotation accurate data is represented as:
Figure BDA0003765960100000034
wherein x is i Representing natural language markup, y i Representing a target code, and C representing the total number of accurately labeled samples;
the label-free data is represented as:
Figure BDA0003765960100000035
wherein, y i 'is code data without label, and C' is total number of samples without label;
the loss function for the fine tuning phase is then:
Figure BDA0003765960100000036
preferably, the step S3 employs:
step S3.1: performing word segmentation and vectorization processing on an input natural language sequence to obtain a word vector;
step S3.2: inputting the word vector into a trained BERT coder to extract semantic features to obtain the deep semantics of the natural language;
step S3.3: and performing circular reasoning by the trained transform decoder according to the extracted deep semantics of the natural language, obtaining the occurrence probability of each word in the word list at each time step, and obtaining a code sequence with the maximum occurrence probability as a target code through code search.
According to the invention, the code generation system based on natural semantic understanding comprises:
a module M1: acquiring a natural language sequence, and preprocessing the acquired natural language sequence to obtain a preprocessed natural language sequence;
a module M2: constructing a code generation model, and training the constructed code generation model to obtain a trained code generation model;
a module M3: the trained code generation model generates target codes according to the input natural language sequence
The code generation model is based on a neural network structure of a coder-decoder, and realizes automatic generation from a natural language sequence to target codes.
Preferably, the module M1 employs: and acquiring a natural language sequence, and performing word segmentation operation and vectorization processing on the natural language sequence in sequence by using a word list and a word segmentation device of the BERT to obtain a word vector.
Preferably, the coder in the code generation model adopts a BERT model to form a BERT coder;
a decoder in the code generation model adopts a plurality of layers of transformers to form a Transformer decoder;
the BERT encoder performs semantic feature extraction on input word vectors through a multi-head self-attention mechanism and a feedforward neural network to obtain deep semantic representation of natural language;
the Transformer decoder performs reasoning circularly through a multi-head self-attention mechanism with a mask, a cross-attention mechanism and a feedforward neural network according to the depth semantic representation of the natural language extracted by the BERT encoder to obtain the decoder output.
Preferably, the module M2 employs:
module M2.1: expanding a training set based on the inaccurate data of the label, and performing bidirectional pre-training on the code generation model to be trained by utilizing the expanded training set;
the annotated inaccuracy data is represented as:
Figure BDA0003765960100000041
wherein x is i Representing a natural language annotation; y is i Representing an object code; n represents the annotated inaccurate dataThe total number of samples of (c);
the method for expanding the training set based on the inaccurate data of the label comprises the following steps:
Figure BDA0003765960100000042
the two-way pre-training loss function employs:
Figure BDA0003765960100000051
f (x; theta) represents the probability output obtained after input x is subjected to the operation of a coder, a decoder and softmax of a code generation model;
module M2.2: carrying out self-coding fine adjustment on the code generation model subjected to bidirectional pre-training by using label-free data and label accurate data to obtain a trained code generation model;
the annotated accurate data is represented as:
Figure BDA0003765960100000052
wherein x is i Representing natural language markup, y i Representing a target code, and C representing the total number of samples with accurate labeling;
the label-free data is represented as:
Figure BDA0003765960100000053
wherein, y i 'is code data without label, and C' is total number of samples without label;
the loss function for the fine tuning phase is then:
Figure BDA0003765960100000054
preferably, the module M3 employs:
module M3.1: performing word segmentation and vectorization processing on an input natural language sequence to obtain a word vector;
module M3.2: inputting the word vector into a trained BERT coder to extract semantic features to obtain the deep semantics of the natural language;
module M3.3: and the trained Transformer decoder performs cycle reasoning according to the extracted depth semantics of the natural language, obtains the occurrence probability of each word in the word list at each time step, and obtains a code sequence with the maximum occurrence probability as a target code through code search.
Compared with the prior art, the invention has the following beneficial effects:
1. by adopting the neural network model structure of the encoder-decoder, the problem of code generation based on natural semantic understanding is solved, the software development efficiency is improved, and the labor cost is reduced;
2. the invention extracts the deep semantic representation of the natural language by adopting the pre-training model BERT as the coder, thereby solving the problem of introduction of natural language field knowledge in the code generation task.
3. The invention adopts a model training method based on a pre-training-fine-tuning paradigm, fully utilizes the supervision information in the limited data, and solves the problems of limited and expensive labeled accurate data in the field of code generation.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
fig. 1 is a block diagram of a coder-decoder network based on the BERT model.
FIG. 2 is a model training flow diagram.
Fig. 3 is a schematic diagram of bi-directional pre-training.
Fig. 4 is a diagram of a self-encoding fine tuning schematic.
Fig. 5 is a diagram of decoder inference process.
FIG. 6 is a schematic diagram of cycle reasoning.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the concept of the invention. All falling within the scope of the invention.
Example 1
According to the code generation method based on natural semantic understanding provided by the invention, as shown in fig. 1-6, the method comprises the following steps:
step S1: acquiring a natural language sequence, and preprocessing the acquired natural language sequence to obtain a preprocessed natural language sequence;
specifically, the step S1 employs: and acquiring a natural language sequence, and performing word segmentation operation and vectorization processing on the natural language sequence in sequence by using a word list and a word segmentation device of the BERT to obtain a word vector.
Step S2: constructing a code generation model, and training the constructed code generation model to obtain a trained code generation model;
and step S3: the trained code generation model generates target codes according to the input natural language sequence
The code generation model is based on a neural network structure of a coder-decoder, and automatic generation from a natural language sequence to target codes is realized.
Specifically, an encoder in the code generation model adopts a BERT model to form a BERT encoder;
a decoder in the code generation model adopts a plurality of layers of transformers to form a Transformer decoder;
the BERT encoder performs semantic feature extraction on input word vectors through a multi-head self-attention mechanism and a feedforward neural network to obtain deep semantic representation of natural language;
the Transformer decoder performs reasoning circularly through a multi-head self-attention mechanism with a mask, a cross-attention mechanism and a feedforward neural network according to the depth semantic representation of the natural language extracted by the BERT encoder to obtain the decoder output.
Specifically, the step S2 employs:
step S2.1: expanding a training set based on the inaccurate data of the label, and performing bidirectional pre-training on the code generation model to be trained by utilizing the expanded training set;
specifically, the step S2.1 employs:
the annotated inaccurate data is represented as:
Figure BDA0003765960100000071
wherein x is i Representing a natural language annotation; y is i Representing an object code; n represents the total number of samples marked with inaccurate data;
the method for expanding the training set based on the inaccurate data of the label comprises the following steps:
Figure BDA0003765960100000072
the two-way pre-training loss function employs:
Figure BDA0003765960100000073
wherein F (x; theta) represents the probability output obtained after the input x is subjected to the encoder, the decoder and the softmax operation of the code generation model. The softmax operation converts decoder output into probabilities;
specifically, the formula of the softmax operation is as follows:
Figure BDA0003765960100000074
wherein S is the length of the word list.
Step S2.2: and carrying out self-coding fine adjustment on the code generation model subjected to bidirectional pre-training by using the label-free data and the label accurate data to obtain the trained code generation model.
Specifically, the step S2.2 employs:
the annotated accurate data is represented as:
Figure BDA0003765960100000075
wherein x is i Representing natural language markup, y i Representing a target code, and C representing the total number of samples with accurate labeling;
the label-free data is represented as:
Figure BDA0003765960100000076
wherein, y i 'is code data without label, C' is total number of samples without label;
the loss function for the fine tuning phase is then:
Figure BDA0003765960100000081
specifically, the step S3 employs:
step S3.1: performing word segmentation and vectorization processing on an input natural language sequence to obtain a word vector;
step S3.2: inputting the word vector into a trained BERT coder to extract semantic features to obtain the deep semantics of the natural language;
step S3.3: and the trained Transformer decoder performs cycle reasoning according to the extracted depth semantics of the natural language, obtains the occurrence probability of each word in the word list at each time step, and obtains a code sequence with the maximum occurrence probability as a target code through code search. The code searching method comprises the following steps: greedy search or bundle search methods. The specific codes are as follows:
input the natural language sequence "comment is an estimate list"
Output target code "comment = [ ]"
According to the code generation system based on natural semantic understanding provided by the invention, as shown in fig. 1-6, the code generation system comprises:
a module M1: acquiring a natural language sequence, and preprocessing the acquired natural language sequence to obtain a preprocessed natural language sequence;
specifically, the module M1 employs: and acquiring a natural language sequence, and performing word segmentation operation and vectorization processing on the natural language sequence in sequence by using a word list and a word segmentation device of the BERT to obtain a word vector.
A module M2: constructing a code generation model, and training the constructed code generation model to obtain a trained code generation model;
a module M3: the trained code generation model generates target codes according to the input natural language sequence
The code generation model is based on a neural network structure of a coder-decoder, and realizes automatic generation from a natural language sequence to target codes.
Specifically, an encoder in the code generation model adopts a BERT model to form a BERT encoder;
a decoder in the code generation model adopts a plurality of layers of transformers to form a Transformer decoder;
the BERT encoder performs semantic feature extraction on input word vectors through a multi-head self-attention mechanism and a feedforward neural network to obtain deep semantic representation of natural language;
and the transform decoder performs reasoning circularly to obtain decoder output through a multi-head self-attention mechanism with a mask, a cross attention mechanism and a feedforward neural network according to the depth semantic representation of the natural language extracted by the BERT encoder.
Specifically, the module M2 employs:
module M2.1: expanding a training set based on the inaccurate data of the label, and performing bidirectional pre-training on the code generation model to be trained by utilizing the expanded training set;
specifically, the module M2.1 employs:
the annotated inaccuracy data is represented as:
Figure BDA0003765960100000091
wherein x is i Representing a natural language annotation; y is i Representing an object code; n represents the total number of samples marked with inaccurate data; the method for expanding the training set based on the inaccurate data of the label comprises the following steps:
Figure BDA0003765960100000092
the two-way pre-training loss function employs:
Figure BDA0003765960100000093
wherein F (x; theta) represents the probability output obtained after the input x is subjected to the encoder, the decoder and the softmax operation of the code generation model. The softmax operation converts decoder output to probabilities;
specifically, the formula of the softmax operation is as follows:
Figure BDA0003765960100000094
wherein S is the length of the word list.
Module M2.2: and carrying out self-coding fine adjustment on the code generation model subjected to bidirectional pre-training by using the label-free data and the label accurate data to obtain the trained code generation model.
In particular, the module M2.2 employs:
the annotated accurate data is represented as:
Figure BDA0003765960100000095
wherein x is i Representing natural language markup, y i Representing a target code, and C representing the total number of accurately labeled samples;
the label-free data is represented as:
Figure BDA0003765960100000096
wherein, y i 'is code data without label, C' is total number of samples without label;
the penalty function for the fine tuning phase is then:
Figure BDA0003765960100000097
specifically, the module M3 employs:
module M3.1: performing word segmentation and vectorization processing on an input natural language sequence to obtain a word vector;
module M3.2: inputting the word vector into a trained BERT coder to extract semantic features to obtain the deep semantics of the natural language;
module M3.3: and performing circular reasoning by the trained transform decoder according to the extracted deep semantics of the natural language, obtaining the occurrence probability of each word in the word list at each time step, and obtaining a code sequence with the maximum occurrence probability as a target code through code search. The code searching method comprises the following steps: greedy search or bundle search methods. The specific code is as follows:
input the natural language sequence "comment is an estimate list"
Output object code "comment = [ ]"
Example 2
Example 2 is a preferred example of example 1
The invention provides a code generation method based on natural semantic understanding, which realizes automatic generation from natural language requirement description to target codes, improves software development efficiency and reduces labor cost. The deep semantic representation of the natural language is extracted through the BERT encoder, and the problem of introduction of domain knowledge is solved. A novel model training method based on a pre-training-fine-tuning paradigm is provided, supervision information in limited data is fully utilized, requirements for data scale and computing resources are reduced, and training time is shortened.
The invention provides a code generation system based on natural semantic understanding, which comprises:
a data processing module: the method comprises the steps of inputting natural language requirement description as a code generation model, outputting corresponding codes as the code generation model, and adopting a training data form as a 'natural language-program language' data pair. The training data is divided into two parts, one part is accurate marking data, code segments are selected and manually marked according to natural language requirements, and therefore marking accuracy is guaranteed; the other part is the inaccurate data of the label. A small amount of accurately labeled data and a large amount of inaccurately labeled data jointly form training data, so that the data volume is sufficient, and the acquisition cost is not too high. The inaccurate data of the label crawls codes and corresponding labels through the Internet so as to reduce the labor cost; replacing codes by crawling the unmarked data through the Internet; and the accurate data is marked manually by crawling codes through the Internet.
A network structure module: to handle the problem of unequal length of input and output sequences, the code generation model employs an overall structure of encoder-decoder, where the encoder employs a pre-trained BERT model and the decoder employs multiple layers of transformers. The input of the whole code generation model is a natural language sequence, and in order to enable the code generation model to understand the input sequence, word segmentation operation needs to be carried out on the input sequence firstly, and then vectorization is carried out, and a series of obtained word vectors are used as the input of the encoder.
The BERT encoder performs semantic feature extraction on input word vectors through a multi-head self-attention mechanism and a feedforward neural network to obtain deep semantic representation of natural language. The transform decoder performs reasoning circularly through a multi-head self-attention mechanism with a mask, a cross-attention mechanism and a feedforward neural network according to the depth semantic representation of the natural language extracted by the encoder to obtain the decoder output. Thereby obtaining a code generation model to be trained.
A training module: firstly, bidirectional pre-training is carried out on a code generation model to be trained by using inaccurate data of label, so that the code generation model can better understand the relationship between natural language and program language. Because the deep learning model is unidirectional and cannot change direction, the bidirectional pre-training method exchanges the input and output of the data pair of natural language-programming language, so that the model learns the conversion of two directions simultaneously.
The annotated inaccuracy data in the training data may be expressed as:
Figure BDA0003765960100000111
wherein x is i For natural language notation, y i For the object code, N is the total number of samples labeled with inaccurate data.
In bi-directional pre-training, we extend the training set to:
Figure BDA0003765960100000112
the loss function of the two-way pre-training phase is then:
Figure BDA0003765960100000113
wherein, F (x; theta) represents the probability output obtained after the input x is operated by an encoder, a decoder and softmax of a model.
And then, carrying out self-coding fine adjustment on the model subjected to bidirectional pre-training by using the unmarked data and the marked accurate data. Wherein, the code part in the inaccurate data can be directly utilized by the non-labeled data.
The annotated accuracy data used for fine tuning can be expressed as:
Figure BDA0003765960100000114
wherein x is i For natural language notation, y i Is the target code, and C is the total number of samples labeled accurately.
The label-free data can be expressed as:
Figure BDA0003765960100000115
wherein, y i 'is code data without label, and C' is the total number of samples without label.
The loss function for the fine tuning phase is then:
Figure BDA0003765960100000116
and obtaining a trained code generation model through a pre-training stage and a fine-tuning stage.
An inference module: the trained code generation model is used in a code generation task based on natural semantic understanding. Giving a natural language requirement description, carrying out word segmentation and vectorization on the natural language sequence, inputting the obtained word vector into a BERT encoder, carrying out semantic feature extraction on the input word vector by the BERT encoder, and carrying out cyclic inference by a transform decoder according to the extracted semantic features to obtain decoder output. The output of the decoder is the probability of possible occurrence of each word in each time step word list, the output is subjected to greedy search or bundle search to obtain a word vector sequence with the maximum occurrence probability, the word vector sequence is converted into a corresponding word sequence according to the word list, and finally the target code is obtained.
According to the invention, the domain knowledge of natural language and programming language can be respectively introduced through the BERT model and external data, so that more manual design is avoided; the model is trained by a training method based on a pre-training-fine-tuning paradigm, limited marking data are fully utilized, and the BERT model is pre-trained on a large English corpus, so that the requirements for data scale and computing resources are reduced, and the model training time is shortened. Meanwhile, the model is subjected to bidirectional pre-training before the fine-tuning stage, and the supervision information in the inaccurate data of the label is fully utilized, so that the model learns the semantic association between the natural language and the program language.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A code generation method based on natural semantic understanding is characterized by comprising the following steps:
step S1: acquiring a natural language sequence, and preprocessing the acquired natural language sequence to obtain a preprocessed natural language sequence;
step S2: constructing a code generation model, and training the constructed code generation model to obtain a trained code generation model;
and step S3: the trained code generation model generates a target code according to an input natural language sequence;
the code generation model is based on a neural network structure of a coder-decoder, and realizes automatic generation from a natural language sequence to target codes.
2. The code generation method based on natural semantic understanding according to claim 1, wherein the step S1 employs: and acquiring a natural language sequence, and performing word segmentation operation and vectorization processing on the natural language sequence in sequence by using a word list and a word segmentation device of the BERT to obtain a word vector.
3. The code generation method based on natural semantic understanding according to claim 1, wherein the coder in the code generation model adopts a BERT model to form a BERT coder;
a decoder in the code generation model adopts a plurality of layers of transformers to form a Transformer decoder;
the BERT encoder performs semantic feature extraction on input word vectors through a multi-head self-attention mechanism and a feedforward neural network to obtain deep semantic representation of natural language;
the Transformer decoder performs reasoning circularly through a multi-head self-attention mechanism with a mask, a cross-attention mechanism and a feedforward neural network according to the depth semantic representation of the natural language extracted by the BERT encoder to obtain the decoder output.
4. The code generation method based on natural semantic understanding according to claim 1, wherein the step S2 employs:
step S2.1: expanding a training set based on the data with inaccurate labeling, and performing bidirectional pre-training on a code generation model to be trained by using the expanded training set;
the annotated inaccurate data is represented as:
Figure FDA0003765960090000011
wherein x is i Representing a natural language annotation; y is i Representing an object code; n represents the total number of samples marked with inaccurate data;
the method for expanding the training set based on the inaccurate data of the label comprises the following steps:
Figure FDA0003765960090000012
the two-way pre-training loss function employs:
Figure FDA0003765960090000021
wherein, F (x; theta) represents the probability output obtained after the input x is operated by an encoder, a decoder and softmax of a code generation model;
step S2.2: carrying out self-coding fine adjustment on the code generation model subjected to bidirectional pre-training by using label-free data and label accurate data to obtain a trained code generation model;
the annotated accurate data is represented as:
Figure FDA0003765960090000022
wherein x is i Representing natural language markup, y i Representing a target code, and C representing the total number of samples with accurate labeling;
the label-free data is represented as:
Figure FDA0003765960090000023
wherein, y i 'is code data without label, and C' is total number of samples without label;
the loss function for the fine tuning phase is then:
Figure FDA0003765960090000024
5. the code generation method based on natural semantic understanding according to claim 1, wherein the step S3 employs:
step S3.1: performing word segmentation and vectorization processing on an input natural language sequence to obtain a word vector;
step S3.2: inputting the word vector into a trained BERT coder to extract semantic features to obtain the deep semantics of the natural language;
step S3.3: and the trained Transformer decoder performs cycle reasoning according to the extracted depth semantics of the natural language, obtains the occurrence probability of each word in the word list at each time step, and obtains a code sequence with the maximum occurrence probability as a target code through code search.
6. A code generation system based on natural semantic understanding, comprising:
a module M1: acquiring a natural language sequence, and preprocessing the acquired natural language sequence to obtain a preprocessed natural language sequence;
a module M2: constructing a code generation model, and training the constructed code generation model to obtain a trained code generation model;
a module M3: generating a target code by the trained code generation model according to the input natural language sequence;
the code generation model is based on a neural network structure of a coder-decoder, and automatic generation from a natural language sequence to target codes is realized.
7. The natural semantic understanding-based code generation system according to claim 6, wherein the module M1 adopts: and acquiring a natural language sequence, and performing word segmentation operation and vectorization processing on the natural language sequence in sequence by using a word list and a word segmentation device of the BERT to obtain a word vector.
8. The code generation system based on natural semantic understanding according to claim 6, wherein the coder in the code generation model adopts a BERT model to form a BERT coder;
a decoder in the code generation model adopts a plurality of layers of transformers to form a Transformer decoder;
the BERT encoder performs semantic feature extraction on input word vectors through a multi-head self-attention mechanism and a feedforward neural network to obtain deep semantic representation of natural language;
the Transformer decoder performs reasoning circularly through a multi-head self-attention mechanism with a mask, a cross-attention mechanism and a feedforward neural network according to the depth semantic representation of the natural language extracted by the BERT encoder to obtain the decoder output.
9. A natural semantic understanding-based code generation system according to claim 6, wherein the module M2 employs:
module M2.1: expanding a training set based on the inaccurate data of the label, and performing bidirectional pre-training on the code generation model to be trained by utilizing the expanded training set;
the annotated inaccurate data is represented as:
Figure FDA0003765960090000031
wherein x is i Representing a natural language annotation; y is i Representing an object code; n represents the total number of samples marked with inaccurate data;
the method for expanding the training set based on the inaccurate data of the label comprises the following steps:
Figure FDA0003765960090000032
the two-way pre-training loss function employs:
Figure FDA0003765960090000033
wherein, F (x; theta) represents the probability output obtained after the input x is operated by an encoder, a decoder and softmax of a code generation model;
module M2.2: carrying out self-coding fine adjustment on the code generation model subjected to bidirectional pre-training by using the label-free data and the label accurate data to obtain a trained code generation model;
the annotated accurate data is represented as:
Figure FDA0003765960090000034
wherein x is i Representing natural language markup, y i Representing a target code, and C representing the total number of samples with accurate labeling;
the label-free data is represented as:
Figure FDA0003765960090000041
wherein, y i 'is code data without label, C' is total number of samples without label;
the penalty function for the fine tuning phase is then:
Figure FDA0003765960090000042
10. a natural semantic understanding-based code generation system according to claim 6, wherein the module M3 employs:
module M3.1: performing word segmentation and vectorization processing on an input natural language sequence to obtain a word vector;
module M3.2: inputting the word vector into a trained BERT coder to extract semantic features to obtain the deep semantics of the natural language;
module M3.3: and the trained Transformer decoder performs cycle reasoning according to the extracted depth semantics of the natural language, obtains the occurrence probability of each word in the word list at each time step, and obtains a code sequence with the maximum occurrence probability as a target code through code search.
CN202210886402.0A 2022-07-26 2022-07-26 Code generation method and system based on natural semantic understanding Pending CN115202640A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210886402.0A CN115202640A (en) 2022-07-26 2022-07-26 Code generation method and system based on natural semantic understanding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210886402.0A CN115202640A (en) 2022-07-26 2022-07-26 Code generation method and system based on natural semantic understanding

Publications (1)

Publication Number Publication Date
CN115202640A true CN115202640A (en) 2022-10-18

Family

ID=83583506

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210886402.0A Pending CN115202640A (en) 2022-07-26 2022-07-26 Code generation method and system based on natural semantic understanding

Country Status (1)

Country Link
CN (1) CN115202640A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116149631A (en) * 2023-01-05 2023-05-23 三峡高科信息技术有限责任公司 Method for generating Web intelligent form based on natural language
CN116364195A (en) * 2023-05-10 2023-06-30 浙大城市学院 Pre-training model-based microorganism genetic sequence phenotype prediction method
CN116501306A (en) * 2023-06-29 2023-07-28 深圳市银云信息技术有限公司 Method for generating interface document code based on natural language description
CN116719514A (en) * 2023-08-08 2023-09-08 安徽思高智能科技有限公司 Automatic RPA code generation method and device based on BERT
CN116820429A (en) * 2023-08-28 2023-09-29 腾讯科技(深圳)有限公司 Training method and device of code processing model, electronic equipment and storage medium
CN117093196A (en) * 2023-09-04 2023-11-21 广东工业大学 Knowledge graph-based programming language generation method and system
CN117492736A (en) * 2023-10-31 2024-02-02 慧之安信息技术股份有限公司 Low-code platform construction method and system based on large model
CN117576248A (en) * 2024-01-17 2024-02-20 腾讯科技(深圳)有限公司 Image generation method and device based on gesture guidance

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116149631A (en) * 2023-01-05 2023-05-23 三峡高科信息技术有限责任公司 Method for generating Web intelligent form based on natural language
CN116149631B (en) * 2023-01-05 2023-10-03 三峡高科信息技术有限责任公司 Method for generating Web intelligent form based on natural language
CN116364195B (en) * 2023-05-10 2023-10-13 浙大城市学院 Pre-training model-based microorganism genetic sequence phenotype prediction method
CN116364195A (en) * 2023-05-10 2023-06-30 浙大城市学院 Pre-training model-based microorganism genetic sequence phenotype prediction method
CN116501306A (en) * 2023-06-29 2023-07-28 深圳市银云信息技术有限公司 Method for generating interface document code based on natural language description
CN116501306B (en) * 2023-06-29 2024-03-26 深圳市银云信息技术有限公司 Method for generating interface document code based on natural language description
CN116719514A (en) * 2023-08-08 2023-09-08 安徽思高智能科技有限公司 Automatic RPA code generation method and device based on BERT
CN116719514B (en) * 2023-08-08 2023-10-20 安徽思高智能科技有限公司 Automatic RPA code generation method and device based on BERT
CN116820429B (en) * 2023-08-28 2023-11-17 腾讯科技(深圳)有限公司 Training method and device of code processing model, electronic equipment and storage medium
CN116820429A (en) * 2023-08-28 2023-09-29 腾讯科技(深圳)有限公司 Training method and device of code processing model, electronic equipment and storage medium
CN117093196A (en) * 2023-09-04 2023-11-21 广东工业大学 Knowledge graph-based programming language generation method and system
CN117093196B (en) * 2023-09-04 2024-03-01 广东工业大学 Knowledge graph-based programming language generation method and system
CN117492736A (en) * 2023-10-31 2024-02-02 慧之安信息技术股份有限公司 Low-code platform construction method and system based on large model
CN117576248A (en) * 2024-01-17 2024-02-20 腾讯科技(深圳)有限公司 Image generation method and device based on gesture guidance
CN117576248B (en) * 2024-01-17 2024-05-24 腾讯科技(深圳)有限公司 Image generation method and device based on gesture guidance

Similar Documents

Publication Publication Date Title
CN115202640A (en) Code generation method and system based on natural semantic understanding
CN113177124B (en) Method and system for constructing knowledge graph in vertical field
CN109543181B (en) Named entity model and system based on combination of active learning and deep learning
CN107632981B (en) Neural machine translation method introducing source language chunk information coding
CN110489102B (en) Method for automatically generating Python code from natural language
CN110442880B (en) Translation method, device and storage medium for machine translation
US20220129450A1 (en) System and method for transferable natural language interface
CN107526717B (en) Method for automatically generating natural language text by structured process model
CN115238029A (en) Construction method and device of power failure knowledge graph
CN116089576A (en) Pre-training model-based fully-generated knowledge question-answer pair generation method
CN115935957A (en) Sentence grammar error correction method and system based on syntactic analysis
CN115238115A (en) Image retrieval method, device and equipment based on Chinese data and storage medium
CN113076718B (en) Commodity attribute extraction method and system
CN117292146A (en) Industrial scene-oriented method, system and application method for constructing multi-mode large language model
Anju et al. Malayalam to English machine translation: An EBMT system
CN112148879B (en) Computer readable storage medium for automatically labeling code with data structure
CN116955644A (en) Knowledge fusion method, system and storage medium based on knowledge graph
CN112036179A (en) Electric power plan information extraction method based on text classification and semantic framework
CN114757181B (en) Method and device for training and extracting event of end-to-end event extraction model based on prior knowledge
CN114528459A (en) Semantic-based webpage information extraction method and system
CN114638227A (en) Named entity identification method, device and storage medium
Jiang et al. A Structure and Content Prompt-based Method for Knowledge Graph Question Answering over Scholarly Data
CN116595992B (en) Single-step extraction method for terms and types of binary groups and model thereof
CN117252201B (en) Knowledge-graph-oriented discrete manufacturing industry process data extraction method and system
CN112651243B (en) Abbreviated project name identification method based on integrated structured entity information and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination