CN116301893A

CN116301893A - Lightweight code generation method based on prompt learning

Info

Publication number: CN116301893A
Application number: CN202310237856.XA
Authority: CN
Inventors: 周宇; 徐一然
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-06-23

Abstract

The invention discloses a lightweight code generation method based on prompt learning, which comprises the following steps: searching natural language, and returning some 'natural language-code' pairs most similar to the current natural language in the corpus; reordering the results, and using the result with the highest score after reordering as prompt information; splicing the search result with the original natural language, and training by using a pre-training language model CodeT 5-base; the light weight method is studied simultaneously in training, and particularly, most of parameters of the model are fixed and do not participate in the fine tuning process of the model; the model is finally tested to achieve the final objective of code generation. The invention researches the effectiveness of template prompt on the model generation result and the feasibility of a light weight method from the angles of prompt learning and light weight pre-training language model, and fuses with deep learning, thereby realizing the generation work of some corresponding code languages according to natural language.

Description

Lightweight code generation method based on prompt learning

Technical Field

The invention belongs to the technical field of automatic code generation of intelligent software engineering, and particularly relates to a code generation method integrating prompt learning and light-weight deep learning.

Background

With the development of modern society, computer software plays an indispensable role in human society, and in the development process of software engineering, related researchers are also always devoted to related research of automatic program generation, because coding is a vital ring in the whole software life cycle, the coding is a direct realization of early demand design, and is also a difficult and complicated task, and the automatic generation of codes can improve the working efficiency of programmers and reduce repeated labor. Therefore, how to make a computer automatically generate codes required by programmers is one of the current studies.

The academic community's study of code generation can be traced back to early formalized logical deduction methods. However, as open source codes continue to evolve, many high quality open source projects provide program information in all aspects, such as code notes, source codes, and test cases, so researchers have begun to divert their eyes to study code generation using some machine-learned methods based on the statistical nature of these data. However, machine learning depends on a certain feature extraction, and the establishment of feature engineering by a manual mode is often unreliable and cannot well reflect the real situation of data, and the use of manually extracted features for code data with high abstract is obviously not completed, so that research shows that deep learning has a good effect on automatic code generation and is a mainstream research method at the same time.

Also, prompt learning has a great deal of research effort in the field of Natural Language Processing (NLP), and the main method is to perform prompt operation on models by constructing templates in a certain format. As a technical means of knowledge-like enhancement, such hints may enhance the performance of model generation, especially when the pre-trained language model is faced with a small amount of data and the generation of programming language not yet involved in the pre-training phase is more pronounced. Due to the dependence of deep learning on a high-quality data set, prompt learning can select proper prompt information from limited data resources to conduct guidance generation on a model.

For lightweight model parameter tuning, the main method is to fix most of parameters in a pre-trained language model, so that the parameters cannot participate in the updating process in the subsequent fine tuning process, because as the data scale is continuously expanded, more and more large-scale models are used for researching code generation, a certain threshold is brought to researchers, and the large models are often trained by means of high computational resources, so that the computational overhead brought by fine tuning of the model can be effectively reduced on the premise that the performance of the model is not obviously affected.

Disclosure of Invention

In view of the above-described drawbacks of the related art, an object of the present invention is to provide a lightweight code generation method based on prompt learning.

In order to achieve the above purpose, the invention adopts the following technical scheme:

according to the lightweight code generation method based on prompt learning, based on the pre-training language model with the CodeT5-base as a main body, the performance of the model can be effectively guaranteed and even improved on the premise of reducing calculation cost by using two technical means of retrieval knowledge prompt and lightweight parameter fine adjustment. The method comprises the following steps:

(1) Searching according to the natural language to be input into the model, and returning some natural language-code pairs most similar to the current natural language into the corpus according to a certain searching mode;

(2) The returned results are reordered according to a certain rule, and the result with the highest score after reordering is recorded so as to be used as an input prompt later;

(3) Splicing the retrieved result and the natural language to be input into the model according to a certain template, training by using a deep learning model transducer (specifically using a pre-training model code 5-base), wherein in the training process, the encoder is responsible for encoding and extracting the data characteristics of the text sequence, inputting the characteristics into the encoder, and capturing the relation between the tokens through a multi-head attention mechanism in the encoder;

(4) When the decoder is trained, combining the hidden state and the group-trunk (namely expected output codes of the model) after being output by the multi-layer coding module to carry out a process of testing-forming, completing probability output of a final prediction result by means of a mask self-attention mechanism in the decoder, calculating a loss value when in training by using a loss function of cross entropy, and carrying out a subsequent model parameter updating process;

(5) A method for generating light codes is implemented simultaneously in training, and particularly, most of parameters of a pre-training language model are fixed, the parameters do not participate in the fine tuning process of the model, and finally, the fine-tuning trained model is tested.

In order to optimize the technical scheme, the specific measures adopted further comprise:

the above-mentioned search operation in step (1) does not adopt conventional information search modes such as BM25 and TFIDF, because these search modes only consider word frequency and repetition degree and fail to solve the problem of semantic relevance of terms, and meanwhile, some pre-training language models based on deep learning exist, such as BERT, when information search is performed, that is, semantic similarity matching, there is an anisotropic (anisotropic) problem between the tokens, so when final cosine similarity is performed, the result obtained by the model cannot truly reflect the similarity, that is, there is a deviation, so that the most direct cause of the problem arises is that the calculation of cosine similarity is established on the premise of standard orthogonality, but the sentence vectors provided by the pre-training language model are not necessarily on the basis of the standard orthogonality, so there is a problem.

In order to solve the above problem, it is decided to attempt to orthogonalize sentence vectors generated by a pre-training language model by using a whitening (whitening) method, so as to achieve isotropy between the token. The whitening method is specifically shown as the following formula:

where x_i is the sentence vector generated by the language model, μ is the mean of the set of sentence vectors, W is the whitening transformation matrix responsible for transforming the covariance matrix of the sentence vectors into a unit matrix.

After the transformation, the new sentence vector is obtained

Then the vector satisfying the standard orthogonal basis, at this time, the semantic similarity between the respective vectors can be calculated by using the cosine similarity formula as follows:

wherein x is _i And y _i Is two sentence vectors.

When the retrieval result is reordered in the step (2), the TopK similar results retrieved after whitening are taken as 5 by K, and the results are reordered again, so that a most similar result is obtained. The Jaccard similarity is used as a sequencing basis during the reordering, and the calculation formula is as follows:

s1 and S2 are token sets of the current query statement and the statement to be sequenced respectively, and the similarity of the two sets is reflected by the value.

The code generator used in the step (3) is a pre-training language model taking the code 5-base of the Encoder-Decoder architecture as a backbone, the structure is naturally applicable to the task of code generation, and the pre-training language model is one of the currently recognized models most suitable for code generation. For model input, advanced preprocessing is needed, specifically, before the similar natural language-code pair obtained after retrieval and reordering in the step (2) is spliced to the original input as model prompt information, reconstructing according to a certain template, and the reconstructed input structure is as follows:

‘Generate<LAN>：<NL`>，Code：<PL`>；Generate<LAN>：<NL>，Code：’

where < LAN > represents the type of language to be generated, < NL 'and < PL' are the natural language requirements most similar to the current input < NL > and their corresponding code implementations, respectively, that were queried during the retrieval phase.

In the step (3), a multi-head attention mechanism of a traditional transducer is adopted when the generated model is finely tuned. In the encoder stage, when a text is input into a model, the text is converted into a vector form after passing through an Embedding layer, and after the processing of position coding, each word vector has respective position information coding, so that the position information of a word can be obtained when a self-attention value is calculated conveniently, the characteristics of a text sequence and the relation among various token can be extracted by calculating the self-attention value, and the calculation formula of the self-attention value is as follows:

wherein Q, K, V is a new vector generated by linear transformation of the input vector, d _k Is QK ^T Dimension of the matrix, softmax (·) the results were subjected to softmax normalization operations.

When the decoder in the step (4) generates the result probability output in the training stage, the invention adopts the cross entropy loss function for describing the difference between the real probability distribution and the predicted probability distribution, so as to calculate the loss value for the subsequent model parameter update, and the calculation formula of the cross entropy loss function is as follows:

where p (x) represents the true distribution of samples and q (x) represents the distribution predicted by the model.

The lightweight code generation method described in the step (5) refers to fixing most of parameters when the model is subjected to fine tuning, and the fixed parameters do not participate in the parameter updating process, so that the calculation resource cost in model training can be reduced, the effect similar to that of a full-parameter fine tuning model can be achieved, and the feasibility of lightweight code generation is met. Specifically, the template in the step (3) is further improved, and besides the fixed prompt (hard prompt), a section of learning vector is added before the prompt, so that the template is used as a soft prompt. The prompting method can be directly executed in an embedded space of a model and has own learning parameters, the parameters can be correspondingly regulated and optimized according to training data of a downstream task, and the parameters are used as a hybrid prompt (hybrid prompt), and the specific prompt format is as follows:

Hybrid Prompt：＝CONCAT(<SP>，<HP>)

wherein < SP > represents softprompt, < HP > represents the immobilization hint already constructed in step (3), CONCAT (·) represents natural splice.

Drawings

Fig. 1 is a schematic diagram of a lightweight code generation method based on prompt learning

Detailed Description

The invention will be further described with reference to examples and drawings, to which reference is made by way of illustration, but not limitation, for the understanding of those skilled in the art.

Referring to fig. 1, the lightweight code generation method based on prompt learning includes the steps of:

(1) Searching is carried out according to the natural language to be input into the model, and some natural language-code pairs which are most similar to the current natural language are returned in the corpus according to a certain searching mode. The invention does not adopt the conventional information retrieval modes such as BM25, TFIDF and the like, because the retrieval modes only consider word frequency and repetition degree and can not solve the problem of semantic relevance of words, and meanwhile, when information retrieval is carried out, namely semantic similarity matching, the problem of anisotropy (aniotopy) between the tokens exists, so that when final cosine similarity is carried out, the similarity can not be truly reflected by the results obtained by the models, namely deviation exists, the most direct cause of the problem is that the calculation of the cosine similarity is established on the premise of standard orthogonal basis, but sentence vectors provided by the pre-trained language model are not necessarily on the standard orthogonal basis, so that the problem exists.

where x_i is the sentence vector generated by the language model, u is the mean of the set of sentence vectors, W is the whitening transformation matrix responsible for transforming the covariance matrix of the sentence vectors into a unit matrix.

After the transformation, the new sentence vector is obtained

wherein x is _i And y _i Is two sentence vectors.

(2) And reordering the returned results according to a certain rule, and recording the result with the highest score after reordering so as to be used as an input prompt later. When reordering the retrieval results, K is taken as 5 for TopK similar results retrieved after whitening, and the results are reordered again to obtain a most similar result. The Jaccard similarity is used as a sequencing basis during the reordering, and the calculation formula is as follows:

(3) Splicing the retrieved result and the natural language to be input into the model according to a certain template, training by using a deep learning model transducer, wherein in the training process, the encoder is responsible for encoding and extracting the data characteristics of the text sequence, inputting the characteristics into the encoder, and capturing the relation between the token through a multi-head attention mechanism in the encoder. The code generator used is a pre-trained language model based on the code 5-base of the Encoder-Decoder architecture, which is naturally suitable for code generation tasks, and which is one of the most suitable models currently accepted for code generation. For model input, advanced preprocessing is needed, specifically, before the similar natural language-code pair obtained after retrieval and reordering in the step (2) is spliced to the original input as model prompt information, reconstructing according to a certain template, and the reconstructed input structure is as follows:

‘Generate<LAN>：<NL`>，Code：<PL`>；Generate<LAN>：<NL>，Code：’

When fine tuning the generative model, a multi-head attention mechanism of a traditional transducer is adopted. In the encoder stage, when a text is input into a model, the text is converted into a vector form after passing through an Embedding layer, and after the processing of position coding, each word vector has respective position information coding, so that the position information of a word can be obtained when a self-attention value is calculated conveniently, the characteristics of a text sequence and the relation among various token can be extracted by calculating the self-attention value, and the calculation formula of the self-attention value is as follows:

(4) And when the decoder is trained, combining the hidden state and the group-trunk (namely the expected output code of the model) after being output by the multi-layer coding module to carry out a process of testing-forming, completing probability output of a final prediction result by means of a mask self-attention mechanism in the decoder, calculating a loss value when in training by using a loss function of cross entropy, and carrying out a subsequent model parameter updating process. When the decoder generates result probability output in a training stage, the invention adopts a cross entropy loss function for describing the difference between the real probability distribution and the predicted probability distribution, so as to calculate a loss value for updating the subsequent model parameters, wherein the calculation formula of the cross entropy loss function is as follows:

(5) A method for generating light codes is implemented simultaneously in training, and particularly, most of parameters of a pre-training language model are fixed, the parameters do not participate in the fine tuning process of the model, and finally, the fine-tuning trained model is tested. The lightweight code generation method can achieve the effects of reducing the cost of computing resources in model training and achieving the effect similar to that of a full-parameter fine tuning model, and meets the feasibility of lightweight code generation. Specifically, the template in the step (3) is further improved, and besides the fixed prompt (hard prompt), a section of learning vector is added before the prompt, so that the template is used as a soft prompt. The prompting method can be directly executed in an embedded space of a model and has own learning parameters, the parameters can be correspondingly regulated and optimized according to training data of a downstream task, and the parameters are used as a hybrid prompt (hybrid prompt), and the specific prompt format is as follows:

Hybrid Prompt：＝CONCAT(<SP>，<HP>)

wherein < SP > represents soft sample, < HP > represents the immobilization hint already constructed in step (3), CONCAT (·) represents the natural splice.

The performance of the process according to the invention is shown experimentally below.

The main contents of the experiment are: the method is characterized in that a 'natural language-code' pair which is most similar to the current natural language in a corpus (training set) is searched, the natural language-code pair is used as prompt information to be spliced with the original input, a code result generated by a model is recorded, and the code result is compared with a real result, so that the effectiveness of the method is reflected. Meanwhile, the memory overhead generated by fine adjustment of the model is recorded to reflect the feasibility of fine adjustment of the light-weight parameters.

The experiment employed two data sets: one is the data set of domain-specific language solubility 4CG and the other is the data set of general language Java, con code.

The calculation mode of the model performance evaluation is BLEU, codeBLEU and EM values, the indexes are widely applied to the task of code generation, and the higher the result is, the better the effect is; the MEM value represents the actual consumed memory condition during fine tuning of the model, and the PAR value represents the proportion of the parameters actually participating in fine tuning, and takes the parameter quantity of the CodeT5-base model as a reference. The specific results are shown in tables 1 and 2:

TABLE 1 Solidity4CG Experimental results Table

TABLE 2 CONCODE test results Table

The bolded data in tables 1 and 2 represent the optimal results of the experiment, and the underlined data represent the suboptimal results of the experiment. Experiments show that the method can ensure that the performance of the model is not influenced or even exceeds the original generation result of the pre-training language model on the premise of effectively reducing the memory overhead during fine adjustment of the model, and the model related by the method has good code generation capability.

The present invention has been described in terms of the preferred embodiments thereof, and it should be understood by those skilled in the art that various modifications can be made without departing from the principles of the invention, and such modifications should also be considered as being within the scope of the invention.

Claims

1. A lightweight code generation method based on prompt learning is characterized by comprising the following steps:

2. The lightweight code generation method based on prompt learning according to claim 1, wherein for the search operation in the step (1), conventional information search modes such as BM25 and TFIDF are not adopted, because these search modes only consider word frequency and repetition degree and fail to solve the problem of semantic relevance of terms, meanwhile, some pre-training language models based on deep learning, such as BERT, exist in the process of information search, that is, semantic similarity matching, and there is an anisotropic (anotropy) problem between the tokens, so when the final cosine similarity is subjected to semantic comparison, the result obtained by the model cannot truly reflect the similarity, that is, there is a deviation, and the most direct cause that the problem occurs is that the calculation of the cosine similarity is to be established on the premise of the standard orthogonality, but the sentence vector provided by the pre-training language model is not necessarily on the standard orthogonality, so that there is a problem.

After the transformation, the new sentence vector is obtained

wherein x is _i And y _i Is two sentence vectors.

3. The method for generating lightweight codes based on prompt learning according to claim 1, wherein when reordering the search results in the step (2), K is taken as 5 for TopK similar results retrieved after whitening, and the results are reordered again to obtain a most similar result. The Jaccard similarity is used as a sequencing basis during the reordering, and the calculation formula is as follows:

4. The method of claim 1, wherein the code generator used in the step (3) is a pre-training language model based on the code 5-base of the Encoder-Decoder architecture, which is naturally applicable to the task of code generation, and is one of the recognized models currently most suitable for code generation. For model input, advanced preprocessing is needed, specifically, before the similar natural language-code pair obtained after retrieval and reordering in the step (2) is spliced to the original input as model prompt information, reconstructing according to a certain template, and the reconstructed input structure is as follows:

‘Generate<LAN>：<NL`>，Code：<PL`>；Generate<LAN>：<NL>，Code：’

5. The method for generating lightweight codes based on prompt learning according to claim 1, wherein in the step (3), a multi-head attention mechanism of a conventional transducer is adopted when the model is generated by fine tuning. In the encoder stage, when a text is input into a model, the text is converted into a vector form after passing through an Embedding layer, and after the processing of position coding, each word vector has respective position information coding, so that the position information of a word can be obtained when a self-attention value is calculated conveniently, the characteristics of a text sequence and the relation among various token can be extracted by calculating the self-attention value, and the calculation formula of the self-attention value is as follows:

6. The method for generating lightweight codes based on prompt learning as recited in claim 1, wherein for the decoder in the step (4) generating the result probability output in the training phase, the present invention uses a cross entropy loss function for describing the difference between the true probability distribution and the predicted probability distribution, so as to calculate the loss value for the subsequent model parameter update, and the calculation formula of the cross entropy loss function is as follows:

7. The method for generating the lightweight code based on prompt learning according to claim 1, wherein the lightweight code generation method described in the step (5) is characterized in that most of parameters are fixed when the model is subjected to fine tuning, and the fixed parameters do not participate in the parameter updating process, so that the cost of computing resources during model training can be reduced, the effect similar to that of a full-parameter fine tuning model can be achieved, and the feasibility of lightweight code generation is satisfied. Specifically, the template in the step (3) is further improved, and besides the fixed prompt (hard prompt), a section of learning vector is added before the prompt, so that the template is used as a soft prompt. The prompting method can be directly executed in an embedded space of a model and has own learning parameters, the parameters can be correspondingly regulated and optimized according to training data of a downstream task, and the parameters are used as a hybrid prompt (hybrid prompt), and the specific prompt format is as follows:

Hybrid Prompt：＝CONCAT(<SP>，<HP>)