CN116991391B

CN116991391B - Code generation and deficiency supplementing method based on large language model

Info

Publication number: CN116991391B
Application number: CN202311243279.1A
Authority: CN
Inventors: 刘春江
Original assignee: Beijing Yifang Technology Co ltd
Current assignee: Beijing Yifang Technology Co ltd
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2023-12-08
Anticipated expiration: 2043-09-26
Also published as: CN116991391A

Abstract

The invention provides a code generation and deficiency supplementing method based on a large language model, which comprises the following steps: collecting code data by utilizing a crawler technology; preprocessing data; then extracting semantic grammar from the data through feature engineering to obtain feature vectors; the feature vector is input into a model, model design training is carried out, and a model result is output; during the period, the model is evaluated, then post-processing optimization is carried out, model parameters are adjusted according to the evaluation result, and the parameters are input into the model; then, deploying and applying the model to obtain application feedback and inputting the application feedback into the model, and performing post-processing optimization on the model again; once the deep learning is finished, the model returns to the adaptive pre-training, and the adaptive pre-training is performed after returning to the adaptive pre-training result; the model after adaptive pre-training is combined with the MOE model, and the model after the combination is designed and trained.

Description

Code generation and deficiency supplementing method based on large language model

Technical Field

The invention relates to a code generation and deficiency supplementing method based on a large language model.

Background

Currently, small models (e.g., recurrent neural networks, RNNs for short) are often used to perform simple code generation tasks, such as generating some basic code segments, functions, or simple code sequences. However, such small models tend to be unsmoothly unsmooth in performing complex code generation tasks and perform poorly.

Meanwhile, the small model has certain application in the aspect of code completion, and when a part of codes or prompts are given, the small model can speculate and complement the rest of the codes, but the accuracy and diversity of the completion are poor. In addition, small models have certain applications in terms of code simple error correction.

However, small models are less capable in terms of language modeling. This means that the small model has difficulty capturing complex grammatical structures and code contexts, resulting in limited quality and accuracy of the generated code.

Moreover, to train an accurate small model, often enough code samples are needed for training. However, the relatively few pre-trained models for a particular task may result in poor accuracy over a particular domain or particular programming language.

Further, the code generated by the small model may be relatively conservative and proprietary, lacking diversity. This makes it poor at solving complex or inventive code generation tasks.

At the same time, the gadget may ignore the structure and format of the code, resulting in an insufficiently clean, poorly readable code. In addition, due to the limited language modeling capabilities of the small model, it may generate unreasonable or erroneous code and may even mislead the developer. In addition, small models have limited accuracy in terms of code error correction. It may detect some common programming errors but it is difficult to capture more complex code defects.

Disclosure of Invention

The invention provides a code generation and deficiency supplementing method based on a large language model, which effectively solves the technical problems existing in the prior art.

Specifically, the invention provides a code generation deficiency supplementing method based on a large language model, which comprises the following steps: collecting language data by utilizing a crawler technology; preprocessing the collected language data; extracting semantics and grammar from the preprocessed data through feature engineering, thereby obtaining feature vectors; the feature vector is input into a large language model, so that the large language model is designed and trained, and a model result is output; during the design and training of the large language model, the large language model is evaluated, then the model is subjected to post-processing and optimization, parameters of the large language model are adjusted according to evaluation results of the post-processing and optimization, the adjusted parameters are input into the large language model for perfecting the construction of the large language model, the perfecting of the construction of the large language model forms a first iteration, the number of times of the first iteration is counted as N1, and the weight is 3; after post-processing and optimizing the large language model, carrying out actual deployment and application on the large language model, obtaining application feedback in the application process, inputting the application feedback into the large language model, carrying out post-processing and optimizing on the large language model again, adjusting parameters of the large language model according to an evaluation result of the post-processing and optimizing process, inputting the adjusted parameters into the large language model to perfect construction of the large language model, wherein the perfect process is called a second iteration, the iteration number of the second iteration is N2, and the weight is 4; once the feature vector finishes deep learning in the large language model, the large language model returns to adaptive pre-training, and in the actual deployment and application process, the adaptive pre-training is performed after returning to the adaptive pre-training result; combining the adaptive pre-trained large language model with the MOE model, and then carrying out model design and training on the combined model, so that a third iteration is formed by combining the model with the MOE model and carrying out the model design and training, wherein the iteration number is N3, and the weight is 6; in the execution process of the method, the total iteration times M=3N1+4N2+6N3 are counted at any time, and once M exceeds a preset time threshold, the data quality after preprocessing is still poor, so that the collection of language data is required to be carried out again.

Preferably, the feature vector is optimized on the basis of the output model result, the optimized feature vector is returned to the feature engineering to further extract the semantics and grammar, and therefore the feature vector is input into the model to perform model design and model training.

Preferably, when data is collected, data collection is restarted according to the data quality, and the data collected again is updated as updated data at the time of data collection.

Preferably, the language class data is collected by a crawler.

Preferably, the maximum value of the number of times threshold is 100.

Through the technical innovation of the invention, the invention can realize the functions of more efficient, more accurate and more intelligent code generation, complementation, error correction and the like. Meanwhile, the method has better data adaptability and generalization capability, can exert excellent performance in different fields and tasks, and improves programming experience and efficiency of developers.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following discussion will discuss the embodiments or the drawings required in the description of the prior art, and it is obvious that the technical solutions described in connection with the drawings are only some embodiments of the present invention, and that other embodiments and drawings thereof can be obtained according to the embodiments shown in the drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a basic flow diagram of a large language model based code generation gap-filling method in accordance with the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made in detail with reference to the accompanying drawings, wherein it is apparent that the embodiments described are only some, but not all embodiments of the present invention. All other embodiments, which can be made by a person of ordinary skill in the art without the need for inventive faculty, are within the scope of the invention, based on the embodiments described in the present invention.

The invention provides a code generation and deficiency supplementing method based on a large language model. FIG. 1 illustrates a basic flow diagram of a large language model based code generation gap-filling method in accordance with the present invention.

According to fig. 1, language class data collection is performed first. In the stage, a multisource data collection strategy can be adopted, specifically, a crawler technology and a data synthesis method can be combined to construct a more comprehensive and rich data set, so that the scale and diversity of the data set can be increased, and the generalization capability and the robustness of a large model are improved.

Data preprocessing is then performed. In the data preprocessing stage, if the data quality of the collected language class data is found to be bad, the data collection can be restarted, and the collected language class data is updated as the data collected by the updating of the data collection stage. In other words, the data preprocessing herein filters the collected original language class data, for example, by using the glabels criterion to determine which data are abnormal data, thereby eliminating the relevant abnormal data in the data preprocessing stage.

The preprocessed language data is subjected to feature engineering to extract semantics and grammar, so that feature vectors are obtained. The feature vector is input into a large language model for design and training, a model result is output, feature vector optimization is carried out on the basis of the output model result, the optimized feature vector is returned to the feature engineering for further extracting semantics and grammar, and the large language model is input for design and training. This is an iterative process in which large language models are continuously refined. The iteration herein is referred to as the first iteration in the present invention. During one run of the present invention, the number of first iterations is counted as N1. During the first iteration, three nodes are traversed: feature engineering, model design and training, feature optimization, so the number of first iterations N1 is weighted 3.

As can be seen in fig. 1, model design and training is at a core hub throughout the process. In the model design and training process, the generalization evaluation can be continuously carried out on the large language model, and once the model generalization is not good, the model can be redesigned, so that the redesigned model is put into the training of the model again.

The model can be further subjected to post-processing and optimization in the model evaluation process, and then parameters of the model can be adjusted according to the evaluation results of the post-processing and optimization process, and the adjusted parameters can be input into the model to complete the construction of the model. This is an iterative process for model design and training, which may be referred to herein as the second iteration. In one operation of the method of the invention, the iteration number of the second iteration is counted as N2. The second iteration goes through four nodes of model design and training, model evaluation, post-processing and optimization, and model parameter adjustment, so the weight of the second iteration is 4.

On the other hand, after the large language model is subjected to post-processing and optimization, the large language model can be actually deployed and applied, application feedback is obtained in the actual application, and the application feedback is reversely input to perform post-processing and optimization on the model, so that the model is adjusted again in practice.

Once the deep learning of feature vectors in the large language model is completed, return adaptive pre-training may be performed. In the deployment and application process, the adaptive pre-training results can be returned to perform adaptive pre-training. And in the adaptive pre-training process, if the adaptability of the model is found to be insufficient, adjusting the pre-training strategy, and performing adaptive pre-training on the basis of the adjusting strategy.

The introduction of adaptive pre-training is a key step in the overall process, through which large language models can be personalized based on data of a particular domain or task. Such adaptive pre-training can enhance the model's code reasoning and generalization capabilities for specific areas, making it more adaptable to different programming tasks.

The model after adaptive pre-training can also be combined with the MOE model, and the model after combination is subjected to model design and training. The introduction of the MOE (Mixture of Experts) model enables the large model to better utilize the advantages of multiple sub-models and improves the accuracy and efficiency of code reasoning by means of expert combinations. The MOE model can automatically select the optimal sub-model under different scenes, so that more intelligent code generation and complement functions are realized.

The model design and training is carried out from the model evaluation, post-processing and optimization stage to deployment and application to adaptive pre-training and then combined with the MOE model until returning, which is also called a third iteration, and in the one-time operation process of the invention, the iteration number is counted as N3, and the iteration path totally experiences six nodes, so the weight is 6.

Therefore, in the code generation and deficiency method based on the large predictive model, the total iteration number M can be counted at any time in the running process. It should be noted that the total number of iterations is not simply N1+ N2+ N3, but is counted as m=3n1 +4n2+6n3 considering the weights of three iterations, and once M exceeds the threshold number of times, it still indicates that the quality of the preprocessed data is poor, thus requiring collection of language data again, which is an overall iterative process.

Of course, in practical application, the threshold of the number of times set for the total number of iterations M is maximum, that is, 100 times.

The method provided by the invention can quantitatively judge whether the large language model runs smoothly or not by iterating for a plurality of times and carrying out organic statistics on the times of each branch iteration to form overall iteration, thereby ensuring the precision and the rigor of the design of the large language model.

The invention has been generally described so far. In combination, the code generation and deficiency supplementing method based on the large predictive model provided by the invention integrates the steps comprehensively. In particular, a complete large model training process is formed by organically combining a plurality of links such as data collection, preprocessing, feature engineering, model design and training, model evaluation, post-processing and optimization, deployment and application through comprehensive integration of the whole process. The comprehensive flow can more efficiently realize a plurality of tasks such as code generation, complementation, error correction and the like, and improves the overall development efficiency.

Further, the introduction of adaptive pre-training as a key step enables large models to be personalized based on data from a particular domain or task. Such adaptive pre-training can enhance the code reasoning ability and generalization ability of large models for specific fields, making them more suitable for different programming tasks.

Furthermore, the MOE (Mixture of Experts) model is introduced, so that the large model can better utilize the advantages of a plurality of sub-models, and the accuracy and the efficiency of code reasoning are improved in an expert combination mode. The MOE model can automatically select the optimal sub-model under different scenes, so that more intelligent code generation and complement functions are realized.

Still further, in the data collection phase, a multi-source data collection strategy is employed: and a multi-source data collection strategy is adopted in the data collection stage, and a crawler technology and a data synthesis method are combined to construct a more comprehensive and rich data set. The innovation point can increase the scale and diversity of the data set and improve the generalization capability and robustness of the large model.

The foregoing description of the exemplary embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any modifications, equivalents, and variations which fall within the spirit and scope of the invention are intended to be included in the scope of the invention.

Claims

1. A method for code generation anaplerosis based on a large language model, the method comprising:

collecting language data by utilizing a crawler technology;

preprocessing the collected language data;

extracting semantics and grammar from the preprocessed data through feature engineering, thereby obtaining feature vectors;

the feature vector is input into a large language model, so that the large language model is designed and trained, and a model result is output;

during the design and training of the large language model, the large language model is evaluated, then the model is subjected to post-processing and optimization, parameters of the large language model are adjusted according to evaluation results of the post-processing and optimization, the adjusted parameters are input into the large language model for perfecting the construction of the large language model, the perfecting of the construction of the large language model forms a first iteration, the number of times of the first iteration is counted as N1, and the weight is 3;

after post-processing and optimizing the large language model, carrying out actual deployment and application on the large language model, obtaining application feedback in the application process, inputting the application feedback into the large language model, carrying out post-processing and optimizing on the large language model again, adjusting parameters of the large language model according to an evaluation result of the post-processing and optimizing process, inputting the adjusted parameters into the large language model to perfect construction of the large language model, wherein the perfect process is called a second iteration, the iteration number of the second iteration is N2, and the weight is 4;

once the feature vector finishes deep learning in the large language model, the large language model returns to adaptive pre-training, and in the actual deployment and application process, the adaptive pre-training is performed after returning to the adaptive pre-training result;

combining the adaptive pre-trained large language model with the MOE model, and then carrying out model design and training on the combined model, so that a third iteration is formed by combining the model with the MOE model and carrying out the model design and training, wherein the iteration number is N3, and the weight is 6;

in the execution process of the method, the total iteration times M=3N1+4N2+6N3 are counted at any time, and once M exceeds a preset time threshold, the data quality after preprocessing is still poor, so that the collection of language data is required to be carried out again.

2. The method of claim 1, wherein the feature vectors are optimized based on the output model results, the optimized feature vectors are returned to the feature engineering to further extract semantics and grammar, and the feature vectors are input to the model for model design and model training.

3. The method of claim 1, wherein upon collecting data, restarting data collection according to data quality, updating the data collected again as updated data upon data collection.

4. The method of claim 1, wherein the language class data is collected by a crawler.

5. The method of claim 1, wherein the maximum value of the number of times threshold is 100.