CN116991391B - Code generation and deficiency supplementing method based on large language model - Google Patents

Code generation and deficiency supplementing method based on large language model Download PDF

Info

Publication number
CN116991391B
CN116991391B CN202311243279.1A CN202311243279A CN116991391B CN 116991391 B CN116991391 B CN 116991391B CN 202311243279 A CN202311243279 A CN 202311243279A CN 116991391 B CN116991391 B CN 116991391B
Authority
CN
China
Prior art keywords
model
language model
training
data
large language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311243279.1A
Other languages
Chinese (zh)
Other versions
CN116991391A (en
Inventor
刘春江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yifang Technology Co ltd
Original Assignee
Beijing Yifang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yifang Technology Co ltd filed Critical Beijing Yifang Technology Co ltd
Priority to CN202311243279.1A priority Critical patent/CN116991391B/en
Publication of CN116991391A publication Critical patent/CN116991391A/en
Application granted granted Critical
Publication of CN116991391B publication Critical patent/CN116991391B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/35Creation or generation of source code model driven
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms
    • G06F8/315Object-oriented languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a code generation and deficiency supplementing method based on a large language model, which comprises the following steps: collecting code data by utilizing a crawler technology; preprocessing data; then extracting semantic grammar from the data through feature engineering to obtain feature vectors; the feature vector is input into a model, model design training is carried out, and a model result is output; during the period, the model is evaluated, then post-processing optimization is carried out, model parameters are adjusted according to the evaluation result, and the parameters are input into the model; then, deploying and applying the model to obtain application feedback and inputting the application feedback into the model, and performing post-processing optimization on the model again; once the deep learning is finished, the model returns to the adaptive pre-training, and the adaptive pre-training is performed after returning to the adaptive pre-training result; the model after adaptive pre-training is combined with the MOE model, and the model after the combination is designed and trained.

Description

Code generation and deficiency supplementing method based on large language model
Technical Field
The invention relates to a code generation and deficiency supplementing method based on a large language model.
Background
Currently, small models (e.g., recurrent neural networks, RNNs for short) are often used to perform simple code generation tasks, such as generating some basic code segments, functions, or simple code sequences. However, such small models tend to be unsmoothly unsmooth in performing complex code generation tasks and perform poorly.
Meanwhile, the small model has certain application in the aspect of code completion, and when a part of codes or prompts are given, the small model can speculate and complement the rest of the codes, but the accuracy and diversity of the completion are poor. In addition, small models have certain applications in terms of code simple error correction.
However, small models are less capable in terms of language modeling. This means that the small model has difficulty capturing complex grammatical structures and code contexts, resulting in limited quality and accuracy of the generated code.
Moreover, to train an accurate small model, often enough code samples are needed for training. However, the relatively few pre-trained models for a particular task may result in poor accuracy over a particular domain or particular programming language.
Further, the code generated by the small model may be relatively conservative and proprietary, lacking diversity. This makes it poor at solving complex or inventive code generation tasks.
At the same time, the gadget may ignore the structure and format of the code, resulting in an insufficiently clean, poorly readable code. In addition, due to the limited language modeling capabilities of the small model, it may generate unreasonable or erroneous code and may even mislead the developer. In addition, small models have limited accuracy in terms of code error correction. It may detect some common programming errors but it is difficult to capture more complex code defects.
Disclosure of Invention
The invention provides a code generation and deficiency supplementing method based on a large language model, which effectively solves the technical problems existing in the prior art.
Specifically, the invention provides a code generation deficiency supplementing method based on a large language model, which comprises the following steps: collecting language data by utilizing a crawler technology; preprocessing the collected language data; extracting semantics and grammar from the preprocessed data through feature engineering, thereby obtaining feature vectors; the feature vector is input into a large language model, so that the large language model is designed and trained, and a model result is output; during the design and training of the large language model, the large language model is evaluated, then the model is subjected to post-processing and optimization, parameters of the large language model are adjusted according to evaluation results of the post-processing and optimization, the adjusted parameters are input into the large language model for perfecting the construction of the large language model, the perfecting of the construction of the large language model forms a first iteration, the number of times of the first iteration is counted as N1, and the weight is 3; after post-processing and optimizing the large language model, carrying out actual deployment and application on the large language model, obtaining application feedback in the application process, inputting the application feedback into the large language model, carrying out post-processing and optimizing on the large language model again, adjusting parameters of the large language model according to an evaluation result of the post-processing and optimizing process, inputting the adjusted parameters into the large language model to perfect construction of the large language model, wherein the perfect process is called a second iteration, the iteration number of the second iteration is N2, and the weight is 4; once the feature vector finishes deep learning in the large language model, the large language model returns to adaptive pre-training, and in the actual deployment and application process, the adaptive pre-training is performed after returning to the adaptive pre-training result; combining the adaptive pre-trained large language model with the MOE model, and then carrying out model design and training on the combined model, so that a third iteration is formed by combining the model with the MOE model and carrying out the model design and training, wherein the iteration number is N3, and the weight is 6; in the execution process of the method, the total iteration times M=3N1+4N2+6N3 are counted at any time, and once M exceeds a preset time threshold, the data quality after preprocessing is still poor, so that the collection of language data is required to be carried out again.
Preferably, the feature vector is optimized on the basis of the output model result, the optimized feature vector is returned to the feature engineering to further extract the semantics and grammar, and therefore the feature vector is input into the model to perform model design and model training.
Preferably, when data is collected, data collection is restarted according to the data quality, and the data collected again is updated as updated data at the time of data collection.
Preferably, the language class data is collected by a crawler.
Preferably, the maximum value of the number of times threshold is 100.
Through the technical innovation of the invention, the invention can realize the functions of more efficient, more accurate and more intelligent code generation, complementation, error correction and the like. Meanwhile, the method has better data adaptability and generalization capability, can exert excellent performance in different fields and tasks, and improves programming experience and efficiency of developers.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following discussion will discuss the embodiments or the drawings required in the description of the prior art, and it is obvious that the technical solutions described in connection with the drawings are only some embodiments of the present invention, and that other embodiments and drawings thereof can be obtained according to the embodiments shown in the drawings without inventive effort for a person skilled in the art.
FIG. 1 illustrates a basic flow diagram of a large language model based code generation gap-filling method in accordance with the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made in detail with reference to the accompanying drawings, wherein it is apparent that the embodiments described are only some, but not all embodiments of the present invention. All other embodiments, which can be made by a person of ordinary skill in the art without the need for inventive faculty, are within the scope of the invention, based on the embodiments described in the present invention.
The invention provides a code generation and deficiency supplementing method based on a large language model. FIG. 1 illustrates a basic flow diagram of a large language model based code generation gap-filling method in accordance with the present invention.
According to fig. 1, language class data collection is performed first. In the stage, a multisource data collection strategy can be adopted, specifically, a crawler technology and a data synthesis method can be combined to construct a more comprehensive and rich data set, so that the scale and diversity of the data set can be increased, and the generalization capability and the robustness of a large model are improved.
Data preprocessing is then performed. In the data preprocessing stage, if the data quality of the collected language class data is found to be bad, the data collection can be restarted, and the collected language class data is updated as the data collected by the updating of the data collection stage. In other words, the data preprocessing herein filters the collected original language class data, for example, by using the glabels criterion to determine which data are abnormal data, thereby eliminating the relevant abnormal data in the data preprocessing stage.
The preprocessed language data is subjected to feature engineering to extract semantics and grammar, so that feature vectors are obtained. The feature vector is input into a large language model for design and training, a model result is output, feature vector optimization is carried out on the basis of the output model result, the optimized feature vector is returned to the feature engineering for further extracting semantics and grammar, and the large language model is input for design and training. This is an iterative process in which large language models are continuously refined. The iteration herein is referred to as the first iteration in the present invention. During one run of the present invention, the number of first iterations is counted as N1. During the first iteration, three nodes are traversed: feature engineering, model design and training, feature optimization, so the number of first iterations N1 is weighted 3.
As can be seen in fig. 1, model design and training is at a core hub throughout the process. In the model design and training process, the generalization evaluation can be continuously carried out on the large language model, and once the model generalization is not good, the model can be redesigned, so that the redesigned model is put into the training of the model again.
The model can be further subjected to post-processing and optimization in the model evaluation process, and then parameters of the model can be adjusted according to the evaluation results of the post-processing and optimization process, and the adjusted parameters can be input into the model to complete the construction of the model. This is an iterative process for model design and training, which may be referred to herein as the second iteration. In one operation of the method of the invention, the iteration number of the second iteration is counted as N2. The second iteration goes through four nodes of model design and training, model evaluation, post-processing and optimization, and model parameter adjustment, so the weight of the second iteration is 4.
On the other hand, after the large language model is subjected to post-processing and optimization, the large language model can be actually deployed and applied, application feedback is obtained in the actual application, and the application feedback is reversely input to perform post-processing and optimization on the model, so that the model is adjusted again in practice.
Once the deep learning of feature vectors in the large language model is completed, return adaptive pre-training may be performed. In the deployment and application process, the adaptive pre-training results can be returned to perform adaptive pre-training. And in the adaptive pre-training process, if the adaptability of the model is found to be insufficient, adjusting the pre-training strategy, and performing adaptive pre-training on the basis of the adjusting strategy.
The introduction of adaptive pre-training is a key step in the overall process, through which large language models can be personalized based on data of a particular domain or task. Such adaptive pre-training can enhance the model's code reasoning and generalization capabilities for specific areas, making it more adaptable to different programming tasks.
The model after adaptive pre-training can also be combined with the MOE model, and the model after combination is subjected to model design and training. The introduction of the MOE (Mixture of Experts) model enables the large model to better utilize the advantages of multiple sub-models and improves the accuracy and efficiency of code reasoning by means of expert combinations. The MOE model can automatically select the optimal sub-model under different scenes, so that more intelligent code generation and complement functions are realized.
The model design and training is carried out from the model evaluation, post-processing and optimization stage to deployment and application to adaptive pre-training and then combined with the MOE model until returning, which is also called a third iteration, and in the one-time operation process of the invention, the iteration number is counted as N3, and the iteration path totally experiences six nodes, so the weight is 6.
Therefore, in the code generation and deficiency method based on the large predictive model, the total iteration number M can be counted at any time in the running process. It should be noted that the total number of iterations is not simply N1+ N2+ N3, but is counted as m=3n1 +4n2+6n3 considering the weights of three iterations, and once M exceeds the threshold number of times, it still indicates that the quality of the preprocessed data is poor, thus requiring collection of language data again, which is an overall iterative process.
Of course, in practical application, the threshold of the number of times set for the total number of iterations M is maximum, that is, 100 times.
The method provided by the invention can quantitatively judge whether the large language model runs smoothly or not by iterating for a plurality of times and carrying out organic statistics on the times of each branch iteration to form overall iteration, thereby ensuring the precision and the rigor of the design of the large language model.
The invention has been generally described so far. In combination, the code generation and deficiency supplementing method based on the large predictive model provided by the invention integrates the steps comprehensively. In particular, a complete large model training process is formed by organically combining a plurality of links such as data collection, preprocessing, feature engineering, model design and training, model evaluation, post-processing and optimization, deployment and application through comprehensive integration of the whole process. The comprehensive flow can more efficiently realize a plurality of tasks such as code generation, complementation, error correction and the like, and improves the overall development efficiency.
Further, the introduction of adaptive pre-training as a key step enables large models to be personalized based on data from a particular domain or task. Such adaptive pre-training can enhance the code reasoning ability and generalization ability of large models for specific fields, making them more suitable for different programming tasks.
Furthermore, the MOE (Mixture of Experts) model is introduced, so that the large model can better utilize the advantages of a plurality of sub-models, and the accuracy and the efficiency of code reasoning are improved in an expert combination mode. The MOE model can automatically select the optimal sub-model under different scenes, so that more intelligent code generation and complement functions are realized.
Still further, in the data collection phase, a multi-source data collection strategy is employed: and a multi-source data collection strategy is adopted in the data collection stage, and a crawler technology and a data synthesis method are combined to construct a more comprehensive and rich data set. The innovation point can increase the scale and diversity of the data set and improve the generalization capability and robustness of the large model.
Through the technical innovation of the invention, the invention can realize the functions of more efficient, more accurate and more intelligent code generation, complementation, error correction and the like. Meanwhile, the method has better data adaptability and generalization capability, can exert excellent performance in different fields and tasks, and improves programming experience and efficiency of developers.
The foregoing description of the exemplary embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any modifications, equivalents, and variations which fall within the spirit and scope of the invention are intended to be included in the scope of the invention.

Claims (5)

1. A method for code generation anaplerosis based on a large language model, the method comprising:
collecting language data by utilizing a crawler technology;
preprocessing the collected language data;
extracting semantics and grammar from the preprocessed data through feature engineering, thereby obtaining feature vectors;
the feature vector is input into a large language model, so that the large language model is designed and trained, and a model result is output;
during the design and training of the large language model, the large language model is evaluated, then the model is subjected to post-processing and optimization, parameters of the large language model are adjusted according to evaluation results of the post-processing and optimization, the adjusted parameters are input into the large language model for perfecting the construction of the large language model, the perfecting of the construction of the large language model forms a first iteration, the number of times of the first iteration is counted as N1, and the weight is 3;
after post-processing and optimizing the large language model, carrying out actual deployment and application on the large language model, obtaining application feedback in the application process, inputting the application feedback into the large language model, carrying out post-processing and optimizing on the large language model again, adjusting parameters of the large language model according to an evaluation result of the post-processing and optimizing process, inputting the adjusted parameters into the large language model to perfect construction of the large language model, wherein the perfect process is called a second iteration, the iteration number of the second iteration is N2, and the weight is 4;
once the feature vector finishes deep learning in the large language model, the large language model returns to adaptive pre-training, and in the actual deployment and application process, the adaptive pre-training is performed after returning to the adaptive pre-training result;
combining the adaptive pre-trained large language model with the MOE model, and then carrying out model design and training on the combined model, so that a third iteration is formed by combining the model with the MOE model and carrying out the model design and training, wherein the iteration number is N3, and the weight is 6;
in the execution process of the method, the total iteration times M=3N1+4N2+6N3 are counted at any time, and once M exceeds a preset time threshold, the data quality after preprocessing is still poor, so that the collection of language data is required to be carried out again.
2. The method of claim 1, wherein the feature vectors are optimized based on the output model results, the optimized feature vectors are returned to the feature engineering to further extract semantics and grammar, and the feature vectors are input to the model for model design and model training.
3. The method of claim 1, wherein upon collecting data, restarting data collection according to data quality, updating the data collected again as updated data upon data collection.
4. The method of claim 1, wherein the language class data is collected by a crawler.
5. The method of claim 1, wherein the maximum value of the number of times threshold is 100.
CN202311243279.1A 2023-09-26 2023-09-26 Code generation and deficiency supplementing method based on large language model Active CN116991391B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311243279.1A CN116991391B (en) 2023-09-26 2023-09-26 Code generation and deficiency supplementing method based on large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311243279.1A CN116991391B (en) 2023-09-26 2023-09-26 Code generation and deficiency supplementing method based on large language model

Publications (2)

Publication Number Publication Date
CN116991391A CN116991391A (en) 2023-11-03
CN116991391B true CN116991391B (en) 2023-12-08

Family

ID=88528641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311243279.1A Active CN116991391B (en) 2023-09-26 2023-09-26 Code generation and deficiency supplementing method based on large language model

Country Status (1)

Country Link
CN (1) CN116991391B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112799655A (en) * 2021-01-26 2021-05-14 浙江香侬慧语科技有限责任公司 Multi-type code automatic generation method, device and medium based on pre-training
CN114186609A (en) * 2021-11-09 2022-03-15 阿里巴巴云计算(北京)有限公司 Model training method and device
CN116680575A (en) * 2023-08-04 2023-09-01 腾讯科技(深圳)有限公司 Model processing method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220383206A1 (en) * 2021-05-28 2022-12-01 Google Llc Task Augmentation and Self-Training for Improved Few-Shot Learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112799655A (en) * 2021-01-26 2021-05-14 浙江香侬慧语科技有限责任公司 Multi-type code automatic generation method, device and medium based on pre-training
CN114186609A (en) * 2021-11-09 2022-03-15 阿里巴巴云计算(北京)有限公司 Model training method and device
CN116680575A (en) * 2023-08-04 2023-09-01 腾讯科技(深圳)有限公司 Model processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN116991391A (en) 2023-11-03

Similar Documents

Publication Publication Date Title
US10417329B2 (en) Dialogue act estimation with learning model
Yuan et al. Modular deep reinforcement learning with temporal logic specifications
CN111967594A (en) Neural network compression method, device, equipment and storage medium
CN113342318B (en) Fine-grained code automatic generation method and system based on multi-view code characteristics
CN110956309A (en) Flow activity prediction method based on CRF and LSTM
CN117634867B (en) RPA flow automatic construction method and system combining large language model and reinforcement learning
CN112000793B (en) Man-machine interaction oriented dialogue target planning method
CN116991391B (en) Code generation and deficiency supplementing method based on large language model
CN117037789B (en) Customer service voice recognition method and device, computer equipment and storage medium
Krishnamoorthy et al. Deep learning techniques and optimization strategies in big data analytics: automated transfer learning of convolutional neural networks using Enas algorithm
CN117220266A (en) New energy predicted output scene generation method and system
CN112257872A (en) Target planning method for reinforcement learning
CN111553142A (en) Natural language reasoning method and system
CN116347504A (en) Communication base station flow prediction method based on EMD-MWOA-LSTM
Xue et al. A risk analysis and prediction model of electric power GIS based on deep learning
CN112181420B (en) Compiler defect positioning method based on reinforcement learning
CN112633516B (en) Performance prediction and machine learning compiling optimization method and device
CN115168864A (en) Intelligent cross contract vulnerability detection method based on feature cross
Lee et al. A deep learning model generation method for code reuse and automatic machine learning
CN116958752B (en) Power grid infrastructure archiving method, device and equipment based on IPKCNN-SVM
CN116527411B (en) Data security intelligent protection model construction method and device and collaboration platform
Pascual De La Puente Efficient, end-to-end and self-supervised methods for speech processing and generation
Arora Action Model Learning for Socio-Communicative Human Robot Interaction
JP6712540B2 (en) Model parameter generation device, model parameter generation method, speech recognition device generation method, program
McCormack Parameter Adaptation using a Meta Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant