CN115906918A

CN115906918A - Method and device for fine tuning of pre-training model

Info

Publication number: CN115906918A
Application number: CN202211502211.6A
Authority: CN
Inventors: 尚骏远; 赵晏彬; 丁思宇; 王硕寰; 孙宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-04-04
Anticipated expiration: 2042-11-28
Also published as: CN115906918B

Abstract

The disclosure provides a fine tuning method and device of a pre-training model, and relates to the field of deep learning, in particular to the technical field of model processing. The method comprises the following steps: determining a pre-training model, wherein the pre-training model comprises N layers of transform layers and N layers of full connection layers, each layer of transform layer is connected to one layer of full connection layer, and parameters corresponding to each layer of full connection layer are determined based on the parameters of the transform layer; determining prompt words corresponding to target downstream tasks based on a pre-training database; and inputting the prompt words and the input text corresponding to the target downstream task into a pre-training model to obtain an output result, calculating a loss value based on the output result, and updating parameters corresponding to the N layers of full-connection layers based on the loss value. The method can effectively inherit multi-task knowledge in a pre-training stage, improve the convergence speed and effect of a small sample task, improve the modeling capability of a model, generate a task with higher modeling difficulty, improve the stability of model training, solve the problem that a large model cannot be converged, and have lower deployment development difficulty and deployment cost.

Description

Method and device for fine tuning of pre-training model

Technical Field

The disclosure relates to the field of deep learning, in particular to the technical field of model processing, and can be applied to fine tuning of models.

Background

The development of the natural language field is moved to the super-large-scale model era, and aiming at a model with a larger scale, the model can be more effectively fine-tuned under the setting of few samples and full samples by adopting a prompt word fine tuning technology of knowledge inheritance under the condition of limited calculation power.

Under the limited calculation power, in order to improve the accuracy of the super-large scale model on a specific downstream task, the method can be divided into the following three model fine tuning methods according to whether a pre-training stage and a fine tuning stage are provided with newly added parameters needing fine tuning:

in the first, pre-training stage and fine-tuning stage, parameters needing fine tuning are not increased.

The second, only the fine-tuning stage, increases the parameters that need to be fine-tuned.

And thirdly, parameters needing fine adjustment are added in the pre-training stage and the fine adjustment stage.

The first method has a poor fine tuning effect. The second approach, in which the parameters added during the fine-tuning phase, may cause the model to perform poorly on small sample tasks due to non-initialization, and also affect the convergence speed of the model. The third method cannot generate tasks compatibly.

Disclosure of Invention

The disclosure provides a fine tuning method and a fine tuning device for a pre-training model.

According to an aspect of the present disclosure, there is provided a method of fine tuning a pre-trained model,

determining a pre-training model, wherein the pre-training model comprises N conversion transform layers and N full-connection layers, each transform layer is connected to one full-connection layer, N is a positive integer, parameters corresponding to each full-connection layer are determined based on the parameters of the transform layers, and the parameters of the transform layers are pre-trained;

determining prompt words corresponding to target downstream tasks based on a pre-training database;

and inputting the prompt words and the input text corresponding to the target downstream task into the pre-training model to obtain an output result, calculating a loss value based on the output result, and adjusting and updating parameters corresponding to the N layers of full-connection layers based on the loss value to finely tune the pre-training model.

The method can effectively inherit multi-task knowledge in a pre-training stage, improves the convergence speed and effect of the small sample task, also improves the modeling capacity of the model, can more effectively model a generating task with higher difficulty, simultaneously improves the stability of model training, and solves the problem that the large model can not be converged. And the deployment and development difficulty and the deployment cost are both low.

According to another aspect of the present disclosure, there is provided a fine tuning apparatus of a pre-trained model,

the device comprises a first determining module, a second determining module and a training module, wherein the first determining module is used for determining a pre-training model, the pre-training model comprises N conversion transform layers and N full-connection layers, each transform layer is respectively connected to one full-connection layer, N is a positive integer, parameters corresponding to each full-connection layer are determined based on the parameters of the transform layers, and the parameters of the transform layers are pre-trained;

the second determining module is used for determining prompt words corresponding to the target downstream tasks based on the pre-training database;

and the input module is used for inputting the cue words and the input text corresponding to the target downstream task into the pre-training model to obtain an output result, calculating a loss value based on the output result, and adjusting and updating the parameters corresponding to the N layers of full-connected layers based on the loss value to finely tune the pre-training model.

According to an aspect of the disclosure, an electronic device is proposed, comprising at least one processor, and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of fine tuning a pre-trained model as embodied in the first aspect of the disclosure.

According to an aspect of the present disclosure, a non-transitory computer-readable storage medium is provided, in which computer instructions are stored, where the computer instructions are configured to cause a computer to perform a method for fine tuning a pre-trained model according to an embodiment of the first aspect of the present disclosure.

According to an aspect of the present disclosure, a computer program product is proposed, which comprises a computer program that, when being executed by a processor, implements the steps of the fine tuning method of the pre-trained model of the first aspect of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1a is a flow diagram of a method of fine tuning of a pre-trained model according to one embodiment of the present disclosure;

FIG. 1b is a schematic structural diagram of a pre-trained model according to an embodiment of the present disclosure;

FIG. 2 is a block diagram of a tuning mechanism of a pre-trained model according to one embodiment of the present disclosure;

FIG. 3 is a block diagram of an electronic device used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

For a better understanding of the present disclosure, reference is made to the following description of the field to which the present disclosure pertains.

Deep learning (deep learning) is a branch of machine learning, and is an algorithm for performing characterization learning on data by using an artificial neural network as a framework. Through multi-layer processing, after the initial low-layer feature representation is gradually converted into the high-layer feature representation, the complex learning tasks such as classification can be completed by using a simple model. Thus, deep learning can be understood as "feature learning" (feature learning) or "representation learning" (presentation learning).

Fig. 1a is a flowchart of a fine tuning method of a pre-training model according to an embodiment of the disclosure, as shown in fig. 1a, the method includes the following steps:

s101, determining a pre-training model.

Fig. 1b is a schematic structural diagram of a pre-training model according to an embodiment of the disclosure, and as shown in fig. 1b, the pre-training model may include N layers of transform layers and N layers of fully-connected layers (i.e., fc layers in fig. 1 b), where N is a positive integer, and each layer of transform layers is connected to one fully-connected layer.

In embodiments of the present disclosure, the parameters of each transform layer are pre-trained. The parameters of the transform layer can comprise key vector corresponding parameters W _k The key value vector corresponds to the parameter W _v Querying the parameter W corresponding to the query vector _q 。

And the parameters corresponding to each full connection layer are determined based on the parameter initialization of the Transformer layer. Wherein, the parameters corresponding to the full connection layer may include: the key vector corresponds to parameter W' _k Parameter W 'corresponding to value vector' _v 。

As shown in FIG. 1b, the parameters corresponding to each fully-connected layer may be initially determined based on the parameters in the Multi-Head Attention module (i.e., the Multi-Head Attention module and the ADD & Norm & FFN module in the figure) of the transform layer connected by the fully-connected layer. Specifically, the parameter corresponding to each fully-connected layer may directly inherit the parameter in the multi-head attention module of the fransformer layer connected to the fully-connected layer, that is, the parameter in the multi-head attention module of the fransformer layer connected to the fully-connected layer is directly initialized and used in the fully-connected layer.

For example, the key vector in the multi-head attention module of the transform layer can be directly corresponding to the parameter W _k The initialization is determined as the key vector corresponding parameter W 'in the fully-connected layer' _k Corresponding the value vector in the multi-head attention module of the transform layer to the parameter W _v The initialization is determined as value vector corresponding parameter W 'in the fully-connected layer' _v 。

Therefore, in the embodiment of the disclosure, full connection layers are introduced for the transform layers, and each full connection layer inherits the pre-trained parameters of the transform layer, so that the multi-task knowledge in the pre-training stage can be effectively inherited, and the convergence speed and effect of the small sample task can be improved when the model is subsequently adjusted.

S102, determining a cue word (namely Soft Prompt in FIG. 1 b) corresponding to the target downstream task based on the pre-training database.

The pre-training database may include: the pre-training task in the pre-training stage and the cue words corresponding to the pre-training task. The pre-training task may include an understanding task (e.g., emotion classification, natural language inference, etc.) and/or a generating task (e.g., abstract, translation, etc.) in natural language processing. The cue word corresponding to the pre-training task may be determined using ERNIE 3.0 Zeus multi-task pre-training techniques.

And, in an embodiment of the present disclosure, the above "determining a cue corresponding to the target downstream task based on the pre-training database" may include the following steps:

determining a specific pre-training task with the similarity degree with a target downstream task higher than a threshold value in pre-training tasks of a pre-training number database;

and determining the cue words corresponding to the specific pre-training task as cue words corresponding to the target downstream task.

That is, in the embodiment of the present disclosure, the cue word corresponding to the target downstream task is hot started based on the cue word of the similar pre-training task, so that the knowledge of the language model fine-tuning model in the pre-training phase can be further effectively inherited, and the target downstream task can be better adapted.

S103, inputting the prompt words and the Input Text (Text Input in the figure 1 b) corresponding to the target downstream task into the pre-training model to obtain an output result, calculating a loss value based on the output result, and adjusting and updating parameters corresponding to the N layers of full connection layers based on the loss value to fine-tune the pre-training model.

As can be seen from step S103, in the embodiment of the present disclosure, in the model adjusting stage, the parameters corresponding to the fully-connected layer are specifically adjusted, and other parameters (such as parameters of the transform layer) in the model are frozen, so that the amount of the parameters that need to be adjusted is small, and the fine-tuning efficiency of the model is improved.

And, the above-mentioned "inputting the cue word and the input text corresponding to the target downstream task into the pre-training model to obtain the output result" may include: inputting the cue words and the input text into a transform layer; and inputting the cue words into the full connection layer.

Wherein, this full connection layer can be used in particular:

calculating a key vector and a value vector corresponding to each character of the input prompt word, and inputting the key vector and the value vector corresponding to each character of the prompt word into a Transformer layer connected with the full connection layer;

the Transformer layer may be used to:

calculating a query vector corresponding to each character of the input prompt word, determining each updating character of the prompt word based on the query vector, the key vector and the value vector corresponding to each character of the prompt word, and taking each updating character of the prompt word as the input of a next transform layer and a next full connection layer; and

and calculating a query vector, a key vector and a value vector corresponding to each character of the input text, and inputting the query vector, the key vector and the value vector corresponding to each character of the input text to the next transform layer.

Specifically, the Transformer layer may calculate a query vector corresponding to each character of the input prompt word, and a query vector, a key vector, and a value vector corresponding to each character of the input text, based on the following formula one;

the formula I is as follows:

q _i ，k _i ，v _i ＝W _q x _i ，W _k x _i ，W _v x _i

q′ _i ＝W _q x′ _i ；

wherein x is _i The ith character, q, representing the input text _i The query vector, k, corresponding to the ith character representing the input text _i Key vector, v, corresponding to the ith character representing the input text _i Value vector, x 'corresponding to ith character representing input text' _i I character, q 'representing a hint word input to a transform layer' _i Query vector, W, corresponding to the i-th character representing the prompt input to the transform layer _q 、W _k 、W _v All are parameters pre-trained by a Transformer layer.

And the full connection layer can calculate the key vector and the value vector corresponding to each character of the input cue word based on the following formula II;

the formula II is as follows:

k′ _i ，v′ _i ＝W′ _k x′ _i ，W _v ′x′ _i ；

wherein, x' _i Ith character, k 'representing a cue word input to the full link layer' _i Key vector v 'corresponding to ith character representing cue word input to full link layer' _i And a value vector corresponding to the ith character of the cue word which is input to the full connection layer.

Further, the transform layer may determine each update character of the cue word based on the following formula three;

the formula III is as follows:

wherein,j represents the total number of characters, x ', included in the cue words input to the l-1 layer full link layer and the l-1 layer Transformer layer' _i，l The ith update character q 'representing the cue word input to the l-1 layer full link layer and the l-1 layer transducer layer calculated by the l-1 layer transducer layer' _i，l-1 A query vector k 'corresponding to the i character of the cue word input to the l-1 layer Transformer layer calculated by the l-1 layer Transformer layer' _j，l-1 A key vector v 'corresponding to the j character of the cue word input to the l-1 th layer fully-connected layer calculated by the l-1 th layer fully-connected layer' _j，l-1 And the value vector corresponding to the j character of the cue word input to the l-1 layer full connection layer is calculated by the l-1 layer full connection layer.

It can be known from the above content that, in the embodiment of the present disclosure, when performing the correlation calculation based on the cue words and the input text corresponding to the target downstream task, the characters of the cue words input to each layer of full-link and transform layers are different, where the characterizations of the characters of the cue words input to the first layer of full-link and transform layers are calculated based on the characterizations of the characters of the cue words input to the l-1 layer of full-link and transform layers, so that the distribution priori knowledge of the pre-training can be effectively utilized, thereby stabilizing the convergence in the fine tuning stage.

Further, in embodiments of the disclosure, after the fine tuning of the pre-training model is completed, k 'corresponding to each fully-connected layer under the target downstream task may be saved for the target downstream task' _i ，v′ _i And other parameters of the fully-connected layer (e.g., key vector corresponds to parameter W' _k Parameter W 'corresponding to value vector' _v ) It can be discarded. Wherein, k 'is corresponding to the full connection layer' _i ，v′ _i The occupied total parameter proportion is less than one thousandth, so that in the subsequent deployment stage, the framework does not need to be modified greatly, and only one additional task related parameter packet is newly added for the downstream task, wherein the related parameter packet can comprise k 'corresponding to the downstream task' _i ，v′ _i Vectors, thus, can be adapted to the deployment of thousands of downstream tasks on the same modelThe method has the advantages that the cost for deploying each independent task is greatly saved, the generation task with higher modeling difficulty can be effectively realized, each downstream task only needs to provide a parameter package of kv vectors of each layer as the input of the model, the inference requirement of a specific task can be completed based on the general model inference service, and the unified deployment and development difficulty and the deployment cost of the multi-task model are greatly reduced.

In summary, in the embodiment of the present disclosure, a pre-training model is determined, where the pre-training model includes N layers of transform layers and N layers of full-link layers, each transform layer is connected to one full-link layer, and N is a positive integer, where a parameter corresponding to each full-link layer is determined based on a parameter of the transform layer, and the parameter of the transform layer is pre-trained; then, determining prompt words corresponding to the target downstream task based on the pre-training database; and inputting the prompt words and the input text corresponding to the target downstream task into the pre-training model to obtain an output result, calculating a loss value based on the output result, and adjusting and updating parameters corresponding to the N layers of full-connection layers based on the loss value to fine-tune the pre-training model. The fine tuning method disclosed by the invention can effectively inherit multi-task knowledge in a pre-training stage, the convergence speed and effect of a small sample task are improved, the modeling capability of the model is also improved, a generating task with higher modeling difficulty can be more effectively realized, meanwhile, the stability of model training is improved, and the problem that a large model cannot be converged is solved. And the deployment development difficulty and the deployment cost are both lower.

Fig. 2 is a block diagram of a fine tuning apparatus for a pre-trained model according to an embodiment of the present disclosure, and as shown in fig. 2, the fine tuning apparatus 200 for the pre-trained model includes:

a first determining module 210, configured to determine a pre-training model, where the pre-training model includes N layers of transform layers and N layers of full-link layers, each transform layer is connected to one full-link layer, and N is a positive integer, where a parameter corresponding to each full-link layer is determined based on a parameter of the transform layer, and the parameter of the transform layer is pre-trained;

the second determining module 220 is configured to determine, based on the pre-training database, a cue word corresponding to the target downstream task;

an input module 230, configured to input the cue word and the input text corresponding to the target downstream task into the pre-training model to obtain an output result, calculate a loss value based on the output result, and adjust and update parameters corresponding to the N full-connected layers based on the loss value to fine-tune the pre-training model.

In some implementations, the determining the parameters corresponding to each fully-connected layer based on the parameters of the fransformer layer includes: the corresponding parameters of each fully connected layer are determined based on the parameters in the multi-head attention module of the Transformer layer connected with the fully connected layer.

In some implementations, the parameters corresponding to the fully connected layer include: key vector corresponds to parameter W' _k Key value vector corresponding parameter W' _v 。

In some implementations, wherein the pre-training database comprises: the pre-training task in the pre-training stage and the cue words corresponding to the pre-training task.

In some implementations, the second determining module 220 is further configured to:

determining a specific pre-training task with the similarity degree with the target downstream task higher than a threshold value in the pre-training tasks of the pre-training number database;

In some implementations, the input module 230 is further configured to:

inputting the cue words and the input text into the Transformer layer;

and inputting the cue words into the full connection layer.

In some implementations, the fully connected layer is to:

calculating a key vector and a value vector corresponding to each character of the input cue word, and inputting the key vector and the value vector corresponding to each character of the cue word into a Transformer layer connected with the full-connection layer;

in some implementations, the transform layer is to:

calculating query vectors corresponding to all characters of input prompt words, determining all updating characters of the prompt words based on the query vectors, key vectors and value vectors corresponding to all characters of the prompt words, and taking all updating characters of the prompt words as input of a next transform layer and a next full connection layer; and

calculating a query vector, a key vector and a value vector corresponding to each character of the input text, and inputting the query vector, the key vector and the value vector corresponding to each character of the input text to a next transform layer.

In some implementations, the Transformer layer is configured to calculate a query vector corresponding to each character of the input prompt word, and a query vector, a key vector, and a value vector corresponding to each character of the input text, based on the following formula one;

the formula I is as follows:

q _i ，k _i ，v _i ＝W _q x _i ，W _k x _i ，W _v x _i

q′i＝W _q x′ _i ；

wherein x is _i The ith character, q, representing the input text _i The query vector, k, corresponding to the ith character representing the input text _i Key vector, v, corresponding to the ith character representing the input text _i Value vector, x 'corresponding to ith character representing input text' _i I character, q 'representing a cue word input to the Transformer layer' _i Query vector, W, corresponding to the i-th character representing the prompt input to the transform layer _q 、W _k 、W _v Pre-training parameters for the transform layer.

In some implementations, the full connection layer is configured to calculate a key vector and a value vector corresponding to each character of the input cue word based on the following formula two;

the formula II is as follows:

k′ _i ，v′ _i ＝W′ _k x′ _i ，W′ _v x′ _i ；

In some implementations, the transform layer is configured to determine each updated character of the cue word based on the following formula three;

the formula III is as follows:

wherein j represents the total number of characters, x ', included in the cue words input to the l-1 layer full link layer and the l-1 layer Transformer layer' _i，l Ith update character, q 'representing the hint word input to the l-layer fully-connected layer and l-layer Transformer layer calculated by the l-1 layer Transformer layer' _i，l-1 A query vector k 'corresponding to the i character of the cue word input to the l-1 layer Transformer layer calculated by the l-1 layer Transformer layer' _j，l-1 A key vector v 'corresponding to the jth character of the hint word input to the l-1 th fully-connected layer calculated by the l-1 th fully-connected layer' _j，l-1 And the value vector corresponding to the j character of the cue word input to the l-1 layer full connection layer is calculated by the l-1 layer full connection layer.

In some implementations, the apparatus further includes:

a saving module, configured to, in response to the pre-training model being trimmed, save k 'corresponding to each fully-connected layer for the target downstream task' _i ，v′ _i 。

The method can effectively inherit multi-task knowledge in a pre-training stage, improves the convergence speed and effect of the small sample task, also improves the modeling capacity of the model, can more effectively model a generating task with higher difficulty, simultaneously improves the stability of model training, and solves the problem that the large model can not be converged. And the deployment development difficulty and the deployment cost are both lower.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 3 illustrates a schematic block diagram of an example electronic device 300 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 3, the apparatus 300 includes a computing unit 301 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 302 or a computer program loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the device 300 can also be stored. The calculation unit 301, ROM 302, and RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Various components in device 300 are connected to I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, or the like; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 301 performs the above described methods and processes, such as the fine tuning method of the pre-trained model of the first aspect embodiment or the fine tuning method of the pre-trained model of the second aspect embodiment. For example, in some embodiments, the method for tuning the pre-trained model of the first aspect embodiment or the method for tuning the pre-trained model of the second aspect embodiment may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 308. In some embodiments, part or all of the computer program may be loaded onto and/or installed onto device 300 via ROM 302 and/or communications unit 309. When the computer program is loaded into the RAM 303 and executed by the computing unit 301, one or more steps of the above-described method for fine tuning of the pre-trained model of the first aspect embodiment or the method for fine tuning of the pre-trained model of the second aspect embodiment may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured in any other suitable way (e.g. by means of firmware) to perform the method of fine tuning of the pre-trained model of the first aspect embodiment or the method of fine tuning of the pre-trained model of the second aspect embodiment.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of fine tuning a pre-trained model, comprising:

2. The method of claim 1, wherein the determining of the parameters corresponding to each fully-connected layer based on the parameters of the fransformer layer comprises:

and determining the corresponding parameters of each fully-connected layer based on the parameters in the multi-head attention module of the transform layer connected with the fully-connected layer.

3. The method of claim 1 or 2, wherein the parameters corresponding to the fully-connected layer comprise: key vector corresponding parameter W _k ', parameter W corresponding to key value vector _v ′。

4. The method of claim 1, wherein the pre-training database comprises: the pre-training task in the pre-training stage and the cue words corresponding to the pre-training task.

5. The method of claim 4, wherein the determining, based on the pre-training database, the cue words corresponding to the target downstream task comprises:

6. The method of claim 3, wherein the inputting the cue words and the input text corresponding to the target downstream task to the pre-trained model comprises:

inputting the cue words and the input text into the Transformer layer;

and inputting the cue words into the full connection layer.

7. The method of claim 6, wherein,

the full connection layer is used for:

calculating a key vector and a value vector corresponding to each character of the input cue word, and inputting the key vector and the value vector corresponding to each character of the cue word into a Transformer layer connected with the full-link layer.

8. The method of claim 6 or 7,

the Transformer layer is used for:

calculating a query vector, a key vector and a value vector corresponding to each character of an input text, and inputting the query vector, the key vector and the value vector corresponding to each character of the input text into a next Transformer layer.

9. The method of claim 8, wherein the Transformer layer calculates a query vector corresponding to each character of the input prompt word and a query vector, a key vector and a value vector corresponding to each character of the input text based on the following formula I;

the formula I is as follows:

q _i ,k _i ,v _i ＝W _q x _i ,W _k x _i ,W _v x _i

q′ _i ＝W _q x′ _i ；

wherein x is _i The ith character, q, representing the input text _i Query vector, k, corresponding to the ith character representing the input text _i Key vector, v, corresponding to the ith character representing the input text _i Value vector, x, corresponding to the ith character representing the input text _i ' the i-th character, q, representing a prompt input to the Transformer layer _i ' indicates a query vector, W, corresponding to the ith character of the prompt word input to the transform layer _q 、W _k 、W _v Pre-training parameters for the transform layer.

10. The method according to claim 7, wherein the fully connected layer calculates a key vector and a value vector corresponding to each character of the input cue word based on the following formula two;

the second formula is as follows:

k _i ′,v _i ′＝W _k ′x _i ′,W _v ′x _i ′；

wherein x is _i ' i-th character, k, representing a cue input to a fully-connected layer _i ' Key vector, v, corresponding to ith character representing prompt input to full connection layer _i ' means input to full connectionAnd the value vector corresponding to the ith character of the prompt word of the layer.

11. The method of claim 8, wherein the fransformer layer determines each updated character of the cue word based on formula three;

the formula III is as follows:

wherein j represents the total number of characters included in the cue words input into the l-1 layer full connection layer and the l-1 layer Transformer layer, and x _i ′ _,l The ith update character, q, representing the prompt words input to the l-1 th layer full link layer and the l-1 th layer transform layer calculated by the l-1 th layer transform layer _i ′ _,l-1 Represents the query vector, k, corresponding to the ith character of the prompt word input to the first-1 layer Transformer layer calculated by the first-1 layer Transformer layer _j ′ _,l-1 A key vector v 'corresponding to the j character of the cue word input to the l-1 th layer fully-connected layer calculated by the l-1 th layer fully-connected layer' _j,l-1 And the value vector corresponding to the j character of the cue word input to the l-1 layer full connection layer is calculated by the l-1 layer full connection layer.

12. The method of claim 10, wherein the method further comprises:

responding to the fine adjustment of the pre-training model, and saving k corresponding to each full connection layer aiming at the target downstream task _i ′，v _i ′。

13. A device for fine tuning a pre-trained model, comprising:

the device comprises a first determining module, a pre-training module and a second determining module, wherein the pre-training module comprises an N-layer conversion transform layer and an N-layer full connection layer, each transform layer is respectively connected to one full connection layer, N is a positive integer, parameters corresponding to each full connection layer are determined based on the parameters of the transform layers, and the parameters of the transform layers are pre-trained;

14. The apparatus of claim 13, wherein the parameters corresponding to each fully-connected layer are determined based on the parameters of the fransformer layer, and comprise:

the corresponding parameters of each fully connected layer are determined based on the parameters in the multi-head attention module of the Transformer layer connected with the fully connected layer.

15. The apparatus of claim 13 or 15, wherein the parameters corresponding to the fully-connected layer comprise: key vector corresponding parameter W _k ', parameter W corresponding to key value vector _v ′。

16. The apparatus of claim 13, wherein the pre-training database comprises: the pre-training task in the pre-training stage and the cue words corresponding to the pre-training task.

17. The apparatus of claim 16, wherein the second determining means is further configured to:

18. The apparatus of claim 15, wherein the input module is further configured to:

inputting the cue words and the input text into the Transformer layer;

and inputting the cue words into the full connection layer.

19. The apparatus of claim 18, wherein,

the full connection layer is used for:

20. The apparatus of claim 18 or 19,

the Transformer layer is used for:

21. The apparatus of claim 20, wherein the Transformer layer is configured to calculate a query vector corresponding to each character of the input prompt word and a query vector, a key vector, and a value vector corresponding to each character of the input text based on a formula one as follows;

the formula I is as follows:

q _i ,k _i ,v _i ＝W _q x _i ,W _k x _i ,W _v x _i

q′ _i ＝W _q x′ _i ；

22. The apparatus of claim 19, wherein the fully-connected layer is configured to calculate a key vector and a value vector corresponding to each character of the input cue word based on the following formula two;

the formula II is as follows:

k _i ′,v _i ′＝W _k ′x _i ′,W _v ′x _i ′；

wherein x is _i ' i-th character, k, representing a cue input to a fully-connected layer _i ' Key vector, v, corresponding to ith character representing prompt input to full connection layer _i ' denotes a value vector corresponding to the ith character of the cue word input to the full-link layer.

23. The apparatus of claim 20, wherein the fransformer layer is configured to determine each updated character of the cue word based on formula three below;

the formula III is as follows:

wherein j represents the total number of characters included in the cue words input into the l-1 layer full connection layer and the l-1 layer Transformer layer, and x _i ′ _,l The ith update character, q, representing the prompt words input to the l-1 th layer full link layer and the l-1 th layer transform layer calculated by the l-1 th layer transform layer _i ′ _,l-1 Representing the query vector, k, corresponding to the ith character of the prompt word input into the first-1 layer Transformer layer calculated by the first-1 layer Transformer layer _j ′ _,l-1 A key vector v 'corresponding to the j character of the cue word input to the l-1 th layer fully-connected layer calculated by the l-1 th layer fully-connected layer' _j,l-1 And the value vector corresponding to the j character of the cue word input to the l-1 layer full connection layer is calculated by the l-1 layer full connection layer.

24. The apparatus of claim 22, wherein the apparatus further comprises:

a saving module, configured to, in response to the pre-training model completing the fine tuning, save k corresponding to each fully-connected layer for the target downstream task _i ′，v _i ′。

25. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

26. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-12.

27. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method according to any one of claims 1-12.