CN115906918B

CN115906918B - Fine tuning method and device for pre-training model

Info

Publication number: CN115906918B
Application number: CN202211502211.6A
Authority: CN
Inventors: 尚骏远; 赵晏彬; 丁思宇; 王硕寰; 孙宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2024-05-17
Anticipated expiration: 2042-11-28
Also published as: CN115906918A

Abstract

The disclosure provides a fine tuning method and device for a pre-training model, relates to the field of deep learning, and particularly relates to the technical field of model processing. The method comprises the following steps: determining a pre-training model, wherein the pre-training model comprises N layers of transformers and N layers of full-connection layers, each Transformer layer is respectively connected to one full-connection layer, and parameters corresponding to each full-connection layer are determined based on parameters of the transformers layer; determining a prompt word corresponding to a target downstream task based on a pre-training database; and inputting the prompt words and the input texts corresponding to the target downstream tasks into the pre-training model to obtain output results, calculating loss values based on the output results, and updating parameters corresponding to the N layers of full-connection layers based on the loss values. The method and the device can effectively inherit multitask knowledge in the pre-training stage, improve the convergence speed and effect of the small sample task, improve the modeling capability of the model, generate the task with greater modeling difficulty, improve the stability of model training, solve the problem that a large model cannot be converged, and have lower deployment development difficulty and deployment cost.

Description

Fine tuning method and device for pre-training model

Technical Field

The present disclosure relates to the field of deep learning, and in particular to the field of model processing technology, and may be applied to fine tuning a model.

Background

The natural language field is developed to a super-large scale model age nowadays, and aiming at a model with larger scale, the model can be more effectively fine-tuned under the setting of few samples and full samples by adopting a knowledge inheritance prompt word fine-tuning technology under the limited calculation power.

Under the limited calculation force, in order to improve the accuracy of the ultra-large scale model on a specific downstream task, according to whether parameters needing fine adjustment are newly added in a pre-training stage and a fine adjustment stage, the method can be divided into the following three model fine adjustment methods:

The first, pre-training and fine tuning stages do not increase the parameters that need fine tuning.

The second, fine-tuning-only phase increases the parameters that need to be fine-tuned.

And the third, pre-training and fine tuning stages are added with parameters to be fine tuned.

Among them, the fine tuning effect of the first method is poor. The parameters added during the fine tuning stage in the second approach may cause the model to perform poorly on small sample tasks due to non-initialisation and also affect the convergence speed of the model. In the third method, the generation task cannot be compatible.

Disclosure of Invention

The disclosure provides a fine tuning method and a device for a pre-training model.

According to an aspect of the present disclosure, there is provided a method of fine tuning a pre-trained model,

Determining a pre-training model, wherein the pre-training model comprises N layers of conversion transformers and N layers of full-connection layers, each layer of transformation transformers is respectively connected to one layer of full-connection layer, N is a positive integer, parameters corresponding to each layer of full-connection layer are determined based on the parameters of the transformation transformers, and the parameters of the transformation transformers are pre-trained;

determining a prompt word corresponding to a target downstream task based on a pre-training database;

and inputting the prompt words and the input text corresponding to the target downstream task into the pre-training model to obtain an output result, calculating a loss value based on the output result, and adjusting and updating parameters corresponding to the N layers of full-connection layers based on the loss value so as to finely adjust the pre-training model.

The method and the device can effectively inherit multitask knowledge in the pre-training stage, improve the convergence speed and effect of small sample tasks, improve the modeling capacity of the model, generate tasks with more effective modeling difficulty, improve the stability of model training, and solve the problem that a large model cannot be converged. And the deployment development difficulty and the deployment cost are low.

According to another aspect of the present disclosure, there is provided a fine tuning device of a pre-trained model,

The first determining module is used for determining a pre-training model, the pre-training model comprises N layers of conversion transformers and N layers of full connection layers, each layer of transformation transformers is connected to one layer of full connection layer respectively, N is a positive integer, parameters corresponding to each layer of full connection layer are determined based on the parameters of the transformation transformers, and the parameters of the transformation transformers are pre-trained;

The second determining module is used for determining prompt words corresponding to the target downstream tasks based on the pre-training database;

and the input module is used for inputting the prompt words and the input texts corresponding to the target downstream tasks into the pre-training model to obtain an output result, calculating a loss value based on the output result, and adjusting and updating parameters corresponding to the N layers of full-connection layers based on the loss value so as to finely adjust the pre-training model.

According to an aspect of the present disclosure, an electronic device is provided, comprising at least one processor, and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of fine tuning a pre-trained model of an embodiment of the first aspect of the present disclosure.

According to an aspect of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a fine tuning method of a pre-trained model of an embodiment of the first aspect of the present disclosure is presented.

According to an aspect of the present disclosure, a computer program product is presented, comprising a computer program which, when being executed by a processor, implements the steps of the method of fine tuning of a pre-trained model of an embodiment of the first aspect of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1a is a flow chart of a method of fine tuning a pre-trained model according to one embodiment of the present disclosure;

FIG. 1b is a schematic diagram of the structure of a pre-training model of one embodiment of the present disclosure;

FIG. 2 is a block diagram of a fine tuning device of a pre-trained model according to one embodiment of the present disclosure;

fig. 3 is a block diagram of an electronic device used to implement an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

For a better understanding of the present disclosure, the following description refers to the field to which the present disclosure relates.

Deep learning (DEEP LEARNING) is a branch of machine learning, and is an algorithm for performing characterization learning on data by taking an artificial neural network as a framework. Through multi-layer processing, the initial low-layer characteristic representation is gradually converted into the high-layer characteristic representation, and then the complex classification and other learning tasks can be completed by using a simple model. Deep learning can thus be understood as "feature learning" (representation learning) or "representation learning".

FIG. 1a is a flow chart of a method of fine tuning a pre-trained model according to one embodiment of the present disclosure, as shown in FIG. 1a, the method comprising the steps of:

s101, determining a pre-training model.

Fig. 1b is a schematic structural diagram of a pre-training model according to an embodiment of the present disclosure, as shown in fig. 1b, where the pre-training model may include N conversion convertors layers and N full-connection layers (i.e. fc layers in fig. 1 b), N is a positive integer, and each conversion convertor layer is connected to one full-connection layer.

In embodiments of the present disclosure, the parameters of each layer of the transducer layer are pre-trained. The parameters of the transform layer may include a parameter W _k corresponding to a key vector, a parameter W _v corresponding to a key value vector, and a parameter W _q corresponding to a query vector.

And the parameters corresponding to each full-connection layer are initialized and determined based on the parameters of the transducer layer. The parameters corresponding to the full connection layer may include: the key vector corresponds to parameter W '_k, and the value vector corresponds to parameter W' _v.

As shown in fig. 1b, the parameters corresponding to each fully-connected layer may be determined based on the initialization of parameters in the Multi-Head Attention module (i.e., multi-Head Attention module and ADD & Norm & FFN module in the figure) of the transducer layer connected by the fully-connected layer. Specifically, the parameters corresponding to each fully-connected layer may directly inherit the parameters in the multi-head attention module of the transducer layer connected to the fully-connected layer, that is, the parameters in the multi-head attention module of the transducer layer connected to the fully-connected layer are directly initialized and used in the fully-connected layer.

For example, the key vector corresponding parameter W _k in the multi-head attention module of the transform layer may be directly initialized and determined as the key vector corresponding parameter W '_k in the full connection layer, and the value vector corresponding parameter W _v in the multi-head attention module of the transform layer may be directly initialized and determined as the value vector corresponding parameter W' _v in the full connection layer.

Therefore, in the embodiment of the disclosure, the full-connection layers are introduced for each layer of the transducer layer, and by enabling each full-connection layer to inherit the pre-trained parameters of the transducer layer, the multi-task knowledge of the pre-training stage can be effectively inherited, and the convergence speed and effect of the small sample task can be improved when the model is adjusted subsequently.

S102, determining a Prompt word (namely Soft Prompt in FIG. 1 b) corresponding to the target downstream task based on the pre-training database.

The pre-training database may include: pre-training task in pre-training stage and corresponding prompt word. The pre-training tasks may include, among other things, understanding tasks (e.g., emotion classification, natural language inference, etc.) and/or generating tasks (e.g., abstract, translation, etc.) in natural language processing. The prompt corresponding to the pre-training task may be determined using ERNIE 3.0.0 Zeus multi-task pre-training techniques.

And, in an embodiment of the present disclosure, the above-mentioned "determining the prompt word corresponding to the target downstream task based on the pre-training database" may include the following steps:

among the pre-training tasks in the pre-training data base, determining a specific pre-training task with the similarity degree with the target downstream task higher than a threshold value;

and determining the prompt word corresponding to the specific pre-training task as the prompt word corresponding to the target downstream task.

That is, in the embodiment of the present disclosure, the prompt word corresponding to the target downstream task is based on the prompt word hot start of the similar pre-training task, so that knowledge of the fine tuning range of the language model in the pre-training stage can be further effectively inherited, and the target downstream task can be better adapted.

S103, inputting the prompt words and Input texts corresponding to the target downstream tasks (namely Text Input in the figure 1 b) into the pre-training model to obtain output results, calculating loss values based on the output results, and adjusting and updating parameters corresponding to the N layers of full-connection layers based on the loss values so as to finely adjust the pre-training model.

In step S103, in the embodiment of the present disclosure, in the model adjustment stage, the parameters corresponding to the full connection layer are specifically adjusted, and other parameters (such as parameters of the transducer layer) in the model are frozen, so that the amount of parameters to be adjusted is small, and the fine adjustment efficiency of the model is improved.

And, the step of inputting the prompt word and the input text corresponding to the target downstream task into the pre-training model to obtain the output result may include: inputting the prompt words and the input text into a transducer layer; and inputting the prompt word into the full-connection layer.

Wherein, this full tie layer specifically can be used for:

Calculating key vectors and value vectors corresponding to all characters of the input prompting words, and inputting the key vectors and the value vectors corresponding to all characters of the prompting words to a transform layer connected with the full connection layer;

the transducer layer may be used to:

calculating a query vector corresponding to each character of the input prompting word, determining each updated character of the prompting word based on the query vector, the key vector and the value vector corresponding to each character of the prompting word, and taking each updated character of the prompting word as the input of the next layer of the conversion layer and the next layer of the full-connection layer; and

And calculating a query vector, a key vector and a value vector corresponding to each character of the input text respectively, and inputting the query vector, the key vector and the value vector corresponding to each character of the input text to a next layer of transform layer respectively.

Specifically, the transducer layer may calculate a query vector corresponding to each character of the input prompt word, and a query vector, a key vector, and a value vector corresponding to each character of the input text, based on the following formula one;

equation one:

q_i,k_i,v_i＝W_qx_i,W_kx_i,W_vx_i

q′_i＝W_qx′_i；

Wherein x _i represents the ith character of the input text, q _i represents the query vector corresponding to the ith character of the input text, k _i represents the key vector corresponding to the ith character of the input text, v _i represents the value vector corresponding to the ith character of the input text, x '_i represents the ith character of the prompt word input to the transducer layer, q' _i represents the query vector corresponding to the ith character of the prompt word input to the transducer layer, and W _q、W_k、W_v are parameters pre-trained by the transducer layer.

And, the full connection layer may calculate key vectors and value vectors corresponding to the respective characters of the input prompt word based on the following formula two;

Formula II:

k′_i,v′_i＝W′_kx′_i,W_v′x′_i；

Wherein x ' _i represents the ith character of the prompt word input to the full connection layer, k ' _i represents a key vector corresponding to the ith character of the prompt word input to the full connection layer, and v ' _i represents a value vector corresponding to the ith character of the prompt word input to the full connection layer.

Further, the transducer layer may determine each updated character of the hint word based on the following equation three;

And (3) a formula III:

Where j represents the total number of characters included in the hint words input to the first full-connection layer and the first full-1 layer, x '_i,l represents the ith updated character of the hint word input to the first full-connection layer and the first full-connection layer calculated by the first full-1 layer, q' _i,l-1 represents the query vector corresponding to the ith character of the hint word input to the first full-1 layer calculated by the first full-1 layer, k '_j,l-1 represents the key vector corresponding to the jth character of the hint word input to the first full-connection layer calculated by the first full-1 layer, and v' _j,l-1 represents the value vector corresponding to the jth character of the hint word input to the first full-1 layer calculated by the first full-connection layer.

As can be seen from the foregoing, in the embodiment of the present disclosure, when the pre-training model performs the correlation calculation based on the prompt word and the input text corresponding to the target downstream task, the characters of the prompt word input to each of the full-connection layer and the transform layer are different, where the characters of the prompt word input to the first full-connection layer and the transform layer are characterized by being calculated based on the characters of the prompt word input to the first-1 full-connection layer and the transform layer, so that the prior knowledge of the pre-training distribution can be effectively utilized, and convergence of the fine tuning stage is stabilized.

Further, in the embodiment of the present disclosure, after fine tuning is completed on the pre-training model, k ' _i,v′_i corresponding to each fully-connected layer under the target downstream task may be saved for the target downstream task, and other parameters of the fully-connected layer (such as the key vector corresponding parameter W ' _k and the value vector corresponding parameter W ' _v) may be discarded. Because the k '_i,v′_i corresponding to the full-connection layer accounts for less than one thousandth of the total parameter proportion, in the subsequent deployment stage, an additional task related parameter packet can be added only for the downstream task without greatly modifying the frame, the related parameter packet can comprise k' _i,v′_i vectors corresponding to the downstream task, so that the deployment of thousands of downstream tasks on the same model can be adapted, the cost for deploying each individual task is greatly saved, the generation task with greater modeling difficulty can be more effectively realized, each downstream task only needs to provide a parameter packet of each layer of kv vector as the input of the model, the inference service based on the general model can be completed, the inference requirement of the specific task is greatly reduced, and the unified deployment development difficulty and deployment cost of the multi-task model are greatly reduced.

In summary, in the embodiment of the disclosure, a pre-training model is determined, where the pre-training model includes N conversion converter layers and N full-connection layers, each converter layer is connected to one full-connection layer, N is a positive integer, and parameters corresponding to each full-connection layer are determined based on the parameters of the converter layer, and the parameters of the converter layer are pre-trained; then, determining a prompt word corresponding to the target downstream task based on the pre-training database; and inputting the prompt words and the input texts corresponding to the target downstream tasks into the pre-training model to obtain output results, calculating loss values based on the output results, and adjusting and updating parameters corresponding to the N layers of full-connection layers based on the loss values so as to finely adjust the pre-training model. The fine tuning method disclosed by the invention can effectively inherit multitask knowledge in a pre-training stage, improves the convergence speed and effect of a small sample task, improves the modeling capacity of a model, can more effectively generate tasks with greater modeling difficulty, improves the stability of model training, and solves the problem that a large model cannot be converged. And the deployment development difficulty and the deployment cost are low.

FIG. 2 is a block diagram of a pre-training model tuning apparatus according to one embodiment of the present disclosure, and as shown in FIG. 2, a pre-training model tuning apparatus 200 includes:

A first determining module 210, configured to determine a pre-training model, where the pre-training model includes N conversion converter layers and N full-connection layers, each converter layer is connected to one full-connection layer, N is a positive integer, and parameters corresponding to each full-connection layer are determined based on parameters of the converter layer, where the parameters of the converter layer are pre-trained;

a second determining module 220, configured to determine a prompt word corresponding to the target downstream task based on the pre-training database;

And the input module 230 is configured to input the prompt word and an input text corresponding to the target downstream task to the pre-training model to obtain an output result, calculate a loss value based on the output result, and adjust and update parameters corresponding to the N-layer full-connection layer based on the loss value, so as to fine-tune the pre-training model.

In some implementations, the parameters corresponding to each fully connected layer are determined based on the parameters of the transducer layer, including: the parameters corresponding to each fully connected layer are determined based on the parameters in the multi-head attention module of the transducer layer connected by the fully connected layer.

In some implementations, the parameters corresponding to the full connectivity layer include: the key vector corresponds to parameter W '_k, and the key value vector corresponds to parameter W' _v.

In some implementations, wherein the pre-training database comprises: the training method comprises a pre-training task in a pre-training stage and prompt words corresponding to the pre-training task.

In some implementations, the second determining module 220 is further configured to:

among the pre-training tasks of the pre-training number database, determining a specific pre-training task with the similarity degree with the target downstream task being higher than a threshold value;

In some implementations, the input module 230 is further to:

inputting the prompt words and the input text to the Transformer layer;

And inputting the prompt word into the full-connection layer.

In some implementations, the fully connected layer is to:

Calculating key vectors and value vectors corresponding to all characters of the input prompting words, and inputting the key vectors and the value vectors corresponding to all characters of the prompting words to a Transformer layer connected with the full connection layer;

in some implementations, the transducer layer is to:

calculating a query vector corresponding to each character of an input prompting word, determining each updated character of the prompting word based on the query vector, key vector and value vector corresponding to each character of the prompting word, and taking each updated character of the prompting word as the input of a next layer of a conversion layer and a next layer of a full-connection layer; and

In some implementations, the transform layer is configured to calculate a query vector corresponding to each character of the input prompt word, and a query vector, a key vector, and a value vector corresponding to each character of the input text, based on the following formula one;

equation one:

q_i,k_i,v_i＝W_qx_i,W_kx_i,W_vx_i

q′i＝W_qx′_i；

Wherein x _i represents the ith character of the input text, q _i represents the query vector corresponding to the ith character of the input text, k _i represents the key vector corresponding to the ith character of the input text, v _i represents the value vector corresponding to the ith character of the input text, x '_i represents the ith character of the prompt word input to the transducer layer, q' _i represents the query vector corresponding to the ith character of the prompt word input to the transducer layer, and W _q、W_k、W_v is a pre-trained parameter of the transducer layer.

In some implementations, the full-connection layer is configured to calculate key vectors and value vectors corresponding to the characters of the input prompt word based on the following formula two;

Formula II:

k′_i,v′_i＝W′_kx′_i,W′_vx′_i；

In some implementations, the transducer layer is configured to determine each updated character of the hint word based on equation three as follows;

And (3) a formula III:

In some implementations, the apparatus further comprises:

And the storage module is used for storing k' _i,v′_i corresponding to each full-connection layer aiming at the target downstream task in response to the finish of the fine adjustment of the pre-training model.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

FIG. 3 illustrates a schematic block diagram of an example electronic device 300 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 3, the apparatus 300 includes a computing unit 301 that may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 302 or a computer program loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the device 300 may also be stored. The computing unit 301, rom_302, and RAM 303 are connected to each other by a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Various components in device 300 are connected to I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, etc.; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, an optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 301 performs the respective methods and processes described above, for example, the fine tuning method of the pre-training model of the embodiment of the first aspect or the fine tuning method of the pre-training model of the embodiment of the second aspect. For example, in some embodiments, the method of tuning a pre-trained model of the first aspect embodiment or the method of tuning a pre-trained model of the second aspect embodiment may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 308. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 300 via the ROM 302 and/or the communication unit 309. When the computer program is loaded into the RAM 303 and executed by the computing unit 301, one or more steps of the method of fine tuning of a pre-trained model of the embodiment of the first aspect or the method of fine tuning of a pre-trained model of the embodiment of the second aspect described above may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured by any other suitable means (e.g. by means of firmware) to perform the tuning method of the pre-training model of the first aspect embodiment or the tuning method of the pre-training model of the second aspect embodiment.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can include or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of fine tuning a pre-trained model, comprising:

determining a pre-training model, wherein the pre-training model comprises N layers of conversion transformers and N layers of full-connection layers, each layer of transformation transformers is connected to one layer of full-connection layer respectively, each layer of transformation transformers is positioned at the rear side of the full-connection layer, and N is a positive integer, wherein parameters corresponding to each layer of full-connection layer are determined based on the parameters of the transformation transformers, and the parameters of the transformation transformers are pre-trained;

Inputting the prompt words and the input text corresponding to the target downstream task into the pre-training model to obtain an output result, calculating a loss value based on the output result, and adjusting and updating parameters corresponding to the N layers of full-connection layers based on the loss value so as to finely adjust the pre-training model;

the determining, by the processor, the parameter corresponding to each fully connected layer based on the parameter of the transducer layer includes:

the parameters corresponding to each full-connection layer are determined based on the parameters in the multi-head attention module of the transducer layer connected with the full-connection layer;

Wherein the pre-training database comprises: the pre-training task in the pre-training stage and the prompt word corresponding to the pre-training task, wherein the determining the prompt word corresponding to the target downstream task based on the pre-training database comprises the following steps:

among the pre-training tasks of the pre-training database, determining a specific pre-training task with a similarity degree with the target downstream task higher than a threshold value;

determining the prompt word corresponding to the specific pre-training task as the prompt word corresponding to the target downstream task;

the step of inputting the prompt word and the input text corresponding to the target downstream task into the pre-training model includes:

inputting the prompt words and the input text to the Transformer layer;

And inputting the prompt word into the full-connection layer.

2. The method of claim 1, wherein the parameters corresponding to the full connectivity layer include: key vector corresponding parametersParameter/>, corresponding to key value vector。

3. The method of claim 1, wherein,

The full connection layer is used for:

And calculating key vectors and value vectors corresponding to the characters of the input prompt words, and inputting the key vectors and the value vectors corresponding to the characters of the prompt words to a Transformer layer connected with the full connection layer.

4. A method according to claim 1 or 3, wherein,

The transducer layer is used for:

5. The method of claim 4, wherein the transducer layer calculates a query vector corresponding to each character of the input prompt word and a query vector, a key vector, and a value vector corresponding to each character of the input text, respectively, based on the following formula one;

equation one:

；

wherein, I-th character representing input text,/>Represents the query vector corresponding to the i-th character of the input text,Key vector corresponding to ith character representing input text,/>Value vector corresponding to ith character representing input text,/>I-th character representing a prompt word input to the transducer layer,/>Query vector corresponding to the ith character of the prompt word input to the transducer layer,/>、/>、/>Pre-trained parameters for the transducer layer.

6. The method of claim 3, wherein the full-connection layer calculates key vectors and value vectors corresponding to respective characters of the input hint word based on the following formula two;

Formula II:

；

wherein, I-th character representing prompt word input to full connection layer,/>Key vector corresponding to ith character of prompt word input to full connection layer,/>, and method for generating key vectorA value vector corresponding to the i-th character of the hint word input to the full join layer.

7. The method of claim 4, wherein the transducer layer determines each updated character of the hint word based on equation three below;

And (3) a formula III:

；

wherein, Representation input to the/>-1 Full tie layer and/>Total number of characters included in hint words of layer-1 convertor layer,/>Represents the/>Input to the layer-1 transducer layer calculationLayer full connection layer and/>I-th updated character of hint word of layer Transformer layer,/>Represents the/>Input to the layer-1 transducer layer calculationQuery vector corresponding to the ith character of the prompt word of the layer-1 transducer layer,/>Represents the/>Input of the-1 full connection layer calculation to the/>1 Th/>, of the hint words of the full-join layerKey vector corresponding to each character,/>Represent the firstInput of the-1 full connection layer calculation to the/>1 Th/>, of the hint words of the full-join layerValue vectors corresponding to the individual characters.

8. The method of claim 6, wherein the method further comprises:

responding to the fine tuning of the pre-training model, and storing the corresponding full connection layer of each layer aiming at the target downstream task ，。

9. A device for fine tuning a pre-trained model, comprising:

The first determining module is used for determining a pre-training model, the pre-training model comprises N layers of conversion transformers and N layers of full-connection layers, each Transformer layer is connected to one full-connection layer respectively, each Transformer layer is located at the rear side of the full-connection layer, N is a positive integer, parameters corresponding to each full-connection layer are determined based on the parameters of the Transformer layer, and the parameters of the Transformer layer are pre-trained;

The input module is used for inputting the prompt words and the input texts corresponding to the target downstream tasks into the pre-training model to obtain output results, calculating loss values based on the output results, and adjusting and updating parameters corresponding to the N layers of full-connection layers based on the loss values so as to finely adjust the pre-training model;

Wherein the pre-training database comprises: the training device comprises a training task in a training stage and prompt words corresponding to the training task, wherein the second determining module is further used for:

Wherein, the input module is further for:

inputting the prompt words and the input text to the Transformer layer;

And inputting the prompt word into the full-connection layer.

10. The apparatus of claim 9, wherein the parameters corresponding to the full connectivity layer comprise: key vector corresponding parametersParameter/>, corresponding to key value vector。

11. The apparatus of claim 9, wherein,

The full connection layer is used for:

12. The device according to claim 9 or 11, wherein,

The transducer layer is used for:

13. The apparatus of claim 12, wherein the transducer layer is configured to calculate a query vector corresponding to each character of the input prompt word and a query vector, a key vector, and a value vector corresponding to each character of the input text, respectively, based on the following formula one;

equation one:

；

14. The apparatus of claim 11, wherein the full-connection layer is configured to calculate a key vector and a value vector corresponding to each character of the input prompt word based on the following formula two;

Formula II:

；

15. The apparatus of claim 12, wherein the transducer layer is configured to determine each updated character of the hint word based on equation three as follows;

And (3) a formula III:

；

16. The apparatus of claim 14, wherein the apparatus further comprises:

a storage module, configured to store, for the target downstream task, a corresponding full-connection layer of each layer in response to the fine adjustment of the pre-training model ，/>。

17. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method according to any of claims 1-8.