CN116910535A

CN116910535A - Programming-based large language model fine tuning-free pre-training method and device

Info

Publication number: CN116910535A
Application number: CN202310671309.2A
Authority: CN
Inventors: 刘英博; 吕武谦; 王建民
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-10-20

Abstract

The invention provides a programming-based large language model fine-tuning-free pre-training method and device, comprising the following steps: extracting keywords of each sample in the code prompt training set to generate a keyword set corresponding to the code prompt training set; assigning a compressed symbol to each keyword in the keyword set; replacing each keyword in the code prompt training set with a compressed symbol to obtain a first code prompt training set; based on the first code prompt training set, the universal large language model is pre-trained by adopting a prompt learning mode. The invention uses the characteristic that the program compiler is insensitive to the semantics of the keywords to sign, compress and encode the keywords in the code prompt training set, thereby greatly reducing the length of the code prompt information input to the large language model, and realizing the efficient pre-training of the large language model by inputting code prompts as few times as possible under the condition that the large language model has length limitation on dialogue input.

Description

Programming-based large language model fine tuning-free pre-training method and device

Technical Field

The invention relates to the technical field of computer model training, in particular to a programming-based large language model fine-tuning-free pre-training method and device.

Background

Today, a generated pre-training large language model represented by GPT issued by OpenAI is developed rapidly, has strong dialogue generation capability, and can solve a large number of intellectual problems through question-answer reply. The pre-training large language model is obtained by fine tuning the general large language model by using specific domain knowledge on the basis of the general large language model, and the answer quality of the model in the specific domain is further optimized.

However, for a particular domain, the task of fine-tuning a generic large language model faces the following difficulties:

1. it is necessary to construct a considerable and effective training set in advance and to repeatedly verify the results after fine tuning of the basic model to avoid the backward generation capability.

2. If the domain specific knowledge changes, temporary training is mainly performed on the changed knowledge, and the trimmed large language model does not have the capability.

3. After fine tuning, once the basic large model needs to be replaced with another domain model, fine tuning is needed again.

4. If the training information for fine tuning is very sensitive, once fine tuning begins, the information is immediately internalized into the large model, and there is a potential for leakage by others.

5. The large language model has a memory capacity of a context, but has a strict limit on the input length, and cannot directly input a large number of training sets in a dialogue mode.

In consideration of the above difficulties and the fact that the general large language model has certain code generation capability, the large language model fine-tuning-free pre-training method for prompting and learning the general large language model by utilizing the code prompting text in the specific field is provided, and the model can have question-answer code generation capability aiming at the specific field. However, the machine has a limitation on the one-time input amount of prompt learning, and the code prompt training set has a larger scale, which affects the effect of the model on learning knowledge in a specific field.

Therefore, a new method for pre-training a large language model without fine tuning needs to be provided.

Disclosure of Invention

In order to solve the problems, the invention provides a programming-based large language model fine-tuning-free pre-training method and device, which utilize the characteristic that a program compiler is insensitive to the semantics of keywords to sign compression coding of the keywords in a code prompt training set, thereby greatly reducing the length of code prompt information input to the large language model, and realizing the efficient pre-training of the large language model by inputting code prompts with the least amount as possible under the condition that the large language model has length limitation on dialogue input. The method is based on the large language model with basic programming capability, when the large language model without fine tuning pre-training is applied, the large language model feeds back a reply code according to the input problem code subjected to keyword symbolization compression coding, the reply code is subjected to symbolization inverse compression coding, and then a human-understandable program can be obtained, and the application process is simple and easy to realize.

In a first aspect, the present invention provides a programming-based large language model fine-tuning-free pre-training method, the method comprising:

extracting keywords of each sample in the code prompt training set to generate a keyword set corresponding to the code prompt training set;

assigning a compressed symbol to each keyword in the keyword set;

replacing each keyword in the code prompt training set with a compressed symbol to obtain a first code prompt training set;

based on the first code prompt training set, the universal large language model is pre-trained by adopting a prompt learning mode.

According to the programming-based large language model fine-tuning-free pre-training method provided by the invention, samples in the code prompt training set are presented in a prompt-complement mode;

wherein the hint is a question or an exemplary code description;

the complement generates a code description of the best reply for the guided model.

According to the programming-based large language model fine tuning-free pre-training method provided by the invention, the keyword of each sample in the code prompt training set is extracted, and the method comprises the following steps:

generating an abstract syntax tree corresponding to each sample;

extracting specific type words from the abstract syntax tree, and taking the extracted specific type words as the keywords of each sample.

The programming-based large language model fine-tuning-free pre-training method provided by the invention comprises specific types of words including but not limited to variable names, method names and special symbols.

According to the programming-based large language model fine-tuning-free pre-training method provided by the invention, when each keyword in the keyword set is allocated with a compressed symbol, the following conditions are met simultaneously:

condition 1: different keywords in the keyword set have different compression symbols;

condition 2: when each keyword in the code prompt training set is replaced by a compressed symbol, the code semantic change cannot be caused.

According to the programming-based large language model fine-tuning-free pre-training method provided by the invention, after each keyword in the keyword set is allocated with a compressed symbol, the method further comprises the following steps:

and establishing a mapping table according to the mapping relation between each keyword in the keyword set and the compressed symbols thereof.

According to the programming-based large language model fine-tuning-free pre-training method provided by the invention, the application process of the pre-trained large language model comprises the following steps:

acquiring a code representation of a question to be interrogated;

inputting the code representation into the pre-trained large language model to obtain a first answer;

and replacing the compressed symbol in the first answer with the key word thereof based on the mapping table to obtain the answer of the question to be queried.

In a second aspect, the present invention provides a programming-based large language model fine-tuning-free pretraining apparatus, the apparatus comprising:

the generation module is used for extracting the keywords of each sample in the code prompt training set so as to generate a keyword set corresponding to the code prompt training set;

the distribution module is used for distributing a compression symbol for each keyword in the keyword set;

the replacing module is used for replacing each keyword in the code prompt training set with a compressed symbol thereof to obtain a first code prompt training set;

and the pre-training module is used for pre-training the universal large language model in a prompt learning mode based on the first code prompt training set.

In a third aspect, the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the programming-based large language model fine-tuning free pre-training method of the first aspect when executing the program.

In a fourth aspect, the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the programming-based large language model fine-tuning free pre-training method of the first aspect.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a programming-based large language model fine-tuning-free pre-training method provided by the invention;

FIG. 2 is a flow diagram of the pre-training and application of the large language model provided by the present invention;

FIG. 3 is a schematic diagram of a programming-based large language model fine-tuning-free pre-training device;

fig. 4 is a schematic structural diagram of an electronic device provided by the present invention;

reference numerals:

410: a processor; 420: a communication interface; 430: a memory; 440: a communication bus.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The programming-based large language model fine-tuning-free pre-training method and apparatus of the present invention are described below in conjunction with fig. 1-4.

In a first aspect, large language models have length limitations on dialogue inputs, and code prompt data programmed for large language model pre-training is easily beyond the limitations to be input impossible. To this end, the present invention provides a programming-based large language model fine-tuning-free pre-training method, as shown in fig. 1, comprising:

s11: extracting keywords of each sample in the code prompt training set to generate a keyword set corresponding to the code prompt training set;

as large language models are deeply coupled with more and more domains, the need for domain-specific knowledge pre-training is also becoming more stringent. The code prompt training set mentioned here is constructed by integrating domain-specific knowledge by domain-specific experts, and the pre-trained large language model has question-answer code generation capability for the specific domain.

S12: assigning a compressed symbol to each keyword in the keyword set;

here, there are no duplicate keywords in the keyword set.

S13: replacing each keyword in the code prompt training set with a compressed symbol to obtain a first code prompt training set;

s14: based on the first code prompt training set, the universal large language model is pre-trained by adopting a prompt learning mode.

The invention provides a programming-based large language model fine-tuning-free pre-training method, which utilizes the characteristic that a program compiler is insensitive to the semantics of keywords to sign, compress and encode the keywords in a code prompt training set, thereby greatly reducing the length of code prompt information input to the large language model, and realizing the efficient pre-training of the large language model by inputting code prompts as few as possible under the condition that the large language model has length limitation on dialogue input.

Specifically, the code prompt training set in S11, in which each sample is presented in a "prompt-complement" manner;

wherein the hint is a question or an exemplary code description;

Examples: each piece of prompt training data needs to give a prompt, which is a description of a problem or an example, for example, "Java code for calculating square area"; in addition, each prompt needs to give a complement to instruct the large language model how to generate the optimal reply, the large language model learns the mode through a meta learning mode, and a section of Java code for realizing the function is taken as the complement for the prompt 'calculating Java code with square area' provided above.

Namely, a large number of prompts are designed according to the field characteristics, and the realization codes of the prompts are taken as completions, so that a code prompt training set with field characteristics of prompt-completions can be constructed.

Specifically, in S11, extracting keywords of each sample in the code prompt training set includes:

generating an abstract syntax tree corresponding to each sample;

The specific types of words referred to herein include, but are not limited to, variable names, method names, and special symbols.

Taking JavaScript language as an example, we analyze the abstract syntax tree of the input code, traverse the abstract syntax tree, and node for each tree:

judging whether the statement or assignment statement of the declaration variable is judged, if yes:

the variable name of this node is added to the variable table.

Judging whether the node is a node for calling the function, if yes:

the function name of this node is added to the variable table.

Thus, the extraction of the long naming set to be replaced is completed.

In computer science, an Abstract Syntax Tree (AST) is a tree representation of an abstract syntax structure of source code written in a programming language, which is composed of a data structure named node. Each tree node contains class name information of the node. Therefore, the keywords of each sample can be extracted by traversing the tree nodes of the abstract syntax tree of each sample.

Specifically, in S12, when a compressed symbol is allocated to each keyword in the keyword set, the following conditions are satisfied at the same time:

The invention is to replace the long-named keywords with short-named compressed symbols to solve the pre-training input problem of programming codes.

The invention can sort the keywords in the keyword set, and then allocate compressed symbols for each keyword in the keyword set according to the mode that the compressed symbols are from small to large, for example, the 1 st keyword compressed symbol is used as "a", the 2 nd keyword compressed symbol is used as "b", the 27 th keyword compressed symbol is used as "aa", and the 28 th keyword compressed symbol is used as "ab". Of course, in the distribution process, two points need to be satisfied, and different keywords in the keyword set at the first point have different compression symbols, for example: the first keyword and the second keyword cannot be 'a' at the same time so as to avoid learning errors; second point: when each keyword in the code prompt training set is replaced by a compressed symbol, the code semantic change cannot be caused, so that the quality of a first code prompt training set generated later is ensured.

Specifically, after assigning a compressed symbol to each keyword in the keyword set, the method further includes:

In fact, the step S13 is equivalent to replacing each keyword in the code prompt training set with its compressed symbol based on the mapping table, so as to obtain a first code prompt training set.

The invention provides a programming-based large language model fine-tuning-free pre-training method, which has the advantages that:

1. the difficulty of fine adjustment is reduced. Because the model does not need to be fine-tuned, only pre-training data information needs to be input in a normal dialogue mode, the requirements of fine-tuning on consumption of hardware resources and the complex process of repeated verification can be avoided.

2. The training method can be used as a supplement to the basic dialogue capability of a large language model, and can quickly complete training of the knowledge of the specific field aiming at the dialogue context, which is very beneficial to the pre-training work of frequent scene change.

3. One-time construction and multiple multiplexing. The method does not need to construct specific training set formats required by large language base models of various manufacturers for many times, has strong universality and can be applied to various large language models.

4. The pre-training is performed in each dialogue context, and no permanent training data is left in the large language model, so that the data information is ensured not to be revealed.

5: the problem of limitation of a large language model to the pre-training input of programming codes can be effectively solved;

in conclusion, the method has good practicability and popularization value, and is expected to provide a high-efficiency and convenient solution for the pre-training work in the field of large language model programming.

Specifically, the application process of the pre-trained large language model comprises the following steps:

acquiring a code representation of a question to be interrogated;

The invention has basic programming capability based on the large language model, when the large language model without fine tuning pre-training is applied, the large language model feeds back a reply code according to the input problem code subjected to keyword symbolization compression coding, and the reply code is subjected to symbolization inverse compression coding to obtain a program which can be understood by human beings, and the application process is simple and easy to realize.

Fig. 2 is a schematic flow chart of pre-training and application of a large language model, and as shown in fig. 2, the core of the pre-training of the invention is that keywords in a code prompt training set are extracted according to an abstract syntax tree (Abstract Syntax Tree, AST), a mapping table for representing the mapping relation between the keywords and symbol codes thereof is automatically generated, the keywords in the code prompt training set are replaced by the symbol codes thereof to obtain a first code prompt training set, and the three steps are abbreviated as AST extraction, mapping table generation and mapping table code replacement. When in actual pre-training, the three steps can be packaged in an encoder to be realized, namely, the encoder is connected in front of the large language model during actual execution, the code prompt training set is input into the encoder to obtain a first code prompt training set output by the encoder, and the large language model is pre-trained by utilizing the first code prompt training set.

When the method is applied, a decoder containing a mapping table generated by the encoder is generated, the dialogue question is input into a pre-trained large language model to obtain an original output, and the original output is utilized to decode the decoder to obtain a dialogue answer.

In a second aspect, the programming-based large language model fine-tuning-free pre-training device provided by the invention is described, and the programming-based large language model fine-tuning-free pre-training device described below and the programming-based large language model fine-tuning-free pre-training method described above can be referred to correspondingly. FIG. 3 is a schematic structural diagram of a programming-based large language model fine-tuning-free pre-training device according to the present invention, as shown in FIG. 3, the device comprises:

the generating module 21 is configured to extract keywords of each sample in the code prompt training set, so as to generate a keyword set corresponding to the code prompt training set;

an allocation module 22, configured to allocate a compressed symbol to each keyword in the keyword set;

a replacing module 23, configured to replace each keyword in the code prompt training set with a compressed symbol thereof, so as to obtain a first code prompt training set;

the pre-training module 24 is configured to pre-train the universal large language model in a prompt learning manner based on the first code prompt training set.

The invention provides a programming-based large language model fine-tuning-free pre-training device, which utilizes the characteristic that a program compiler is insensitive to the semantics of keywords to sign, compress and encode the keywords in a code prompt training set, thereby greatly reducing the length of code prompt information input to the large language model, and realizing the efficient pre-training of the large language model by inputting code prompts as few as possible under the condition that the large language model has length limitation on dialogue input.

Based on the above embodiments, as an alternative embodiment, the samples in the code hint training set are presented in a "hint-complement" manner;

wherein the hint is a question or an exemplary code description;

On the basis of the foregoing embodiments, as an optional embodiment, the generating module includes:

the generation unit is used for generating an abstract syntax tree corresponding to each sample;

and the setting unit is used for extracting the specific type words from the abstract syntax tree and taking the extracted specific type words as the keywords of each sample.

On the basis of the above embodiments, as an alternative embodiment, the specific type words include, but are not limited to, variable names, method names and special symbols.

On the basis of the above embodiments, as an alternative embodiment, when assigning a compressed symbol to each keyword in the keyword set, the following conditions are satisfied at the same time:

On the basis of the foregoing embodiments, as an optional embodiment, after assigning a compressed symbol to each keyword in the keyword set, the method further includes:

Based on the foregoing embodiments, as an alternative embodiment, the application process of the pre-trained large language model includes:

acquiring a code representation of a question to be interrogated;

In a third aspect, fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, where the electronic device may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. Processor 410 may invoke logic instructions in memory 430 to perform a programming-based large language model fine-tuning free pre-training method comprising: extracting keywords of each sample in the code prompt training set to generate a keyword set corresponding to the code prompt training set; assigning a compressed symbol to each keyword in the keyword set; replacing each keyword in the code prompt training set with a compressed symbol to obtain a first code prompt training set; based on the first code prompt training set, the universal large language model is pre-trained by adopting a prompt learning mode.

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In a fourth aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the programming-based large language model fine-tuning-free pre-training method provided by the above methods, the method comprising: extracting keywords of each sample in the code prompt training set to generate a keyword set corresponding to the code prompt training set; assigning a compressed symbol to each keyword in the keyword set; replacing each keyword in the code prompt training set with a compressed symbol to obtain a first code prompt training set; based on the first code prompt training set, the universal large language model is pre-trained by adopting a prompt learning mode.

In a fifth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the programming-based large language model fine-tuning free pre-training method provided by the above methods, the method comprising: extracting keywords of each sample in the code prompt training set to generate a keyword set corresponding to the code prompt training set; assigning a compressed symbol to each keyword in the keyword set; replacing each keyword in the code prompt training set with a compressed symbol to obtain a first code prompt training set; based on the first code prompt training set, the universal large language model is pre-trained by adopting a prompt learning mode.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A programming-based large language model fine-tuning-free pre-training method, the method comprising:

assigning a compressed symbol to each keyword in the keyword set;

2. The programming-based large language model fine-tuning-free pre-training method of claim 1, wherein samples in the code hint training set are presented in a hint-complement manner;

wherein the hint is a question or an exemplary code description;

3. The programming-based large language model fine-tuning-free pre-training method of claim 1, wherein the extracting keywords for each sample in the code hint training set comprises:

generating an abstract syntax tree corresponding to each sample;

4. A programming-based large language model fine-tuning-free pre-training method as claimed in claim 3, wherein the specific type of words include, but are not limited to, variable names, method names and special symbols.

5. The programming-based large language model fine-tuning-free pre-training method of claim 1, wherein each keyword in the keyword set is assigned a compressed symbol while satisfying the following conditions:

6. The programming-based large language model fine-tuning free pre-training method of any one of claims 1-5, further comprising, after assigning a compressed symbol to each keyword in the keyword set:

7. The programming-based large language model fine-tuning-free pre-training method of claim 6, wherein the pre-trained large language model application comprises:

acquiring a code representation of a question to be interrogated;

8. A programming-based large language model fine-tuning-free pre-training device, the device comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the programming-based large language model fine-tuning-free pre-training method of any one of claims 1 to 7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the programming-based large language model fine-tuning free pre-training method of any one of claims 1 to 7.