CN116090508A

CN116090508A - Fine tuning method and device for pre-training language model

Info

Publication number: CN116090508A
Application number: CN202211227948.1A
Authority: CN
Inventors: 刘林林; 李星漩; 邴立东; 李昕; 司罗; 沙菲克.爵蒂; 梅根.维普尔.塔卡
Original assignee: Alibaba China Co Ltd; Nanyang Technological University
Current assignee: Alibaba China Co Ltd; Nanyang Technological University
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2023-05-09

Abstract

The embodiment of the application discloses a fine tuning method and device for a pre-training language model. Comprising the following steps: obtaining a pre-constructed enhancement model, wherein the enhancement model is formed by adding a multi-view compression representation module between at least two hidden layers of a pre-training language model, and the multi-view compression representation model comprises N layers of self-encoders; performing fine tuning training on the enhancement model by utilizing training data to obtain a target model, wherein the target model comprises the enhancement model obtained by pre-training and a downstream prediction model; updating all model parameters in the process of fine tuning training, wherein the training target is to minimize the difference between the output result of the downstream prediction module and the expected value; and after the fine tuning training is finished, removing the multi-view compressed representation module from the target model to obtain a prediction model. The method and the device can reduce the risk of overfitting and improve the robustness of the model.

Description

Fine tuning method and device for pre-training language model

Technical Field

The present disclosure relates to the field of deep learning technologies, and in particular, to a method and an apparatus for fine tuning a pre-training language model.

Background

In recent years, fine tuning on a large pre-trained language model to obtain a predictive model has become one of the most commonly used methods in natural language processing tasks, which have achieved excellent performance in many tasks. However, the pre-training model is easily over-fitted in low-resource scenarios, i.e. when there is less training data, resulting in reduced performance. Marking large amounts of training data tends to be costly in terms of time and money. Therefore, there is a need for a model training method that can reduce the risk of overfitting and improve the prediction effect of the prediction model.

Disclosure of Invention

In view of this, the present application provides a method and apparatus for fine tuning a pre-training language model, so as to reduce the risk of overfitting and improve the prediction effect of the prediction model.

The application provides the following scheme:

in a first aspect, a method for fine tuning a pre-trained language model is provided, the method comprising:

obtaining a pre-built enhancement model, wherein the enhancement model is obtained by adding a multi-view compression representation module between at least two hidden layers of a pre-training language model, the multi-view compression representation model comprises N layers of self-encoders, and N is a positive integer;

performing fine tuning training on the enhancement model by using training data to obtain a target model, wherein the target model comprises the enhancement model and a downstream prediction model; updating parameters of the pre-training language model, the multi-view compression representation module and the downstream prediction model in the fine-tuning training process, wherein a training target is to minimize the difference between an output result of the downstream prediction module and an expected value;

and after the fine tuning training is finished, removing the multi-view compression representation module from the target model obtained by training to obtain a prediction model.

According to one implementation in an embodiment of the present application, the N-level self-encoders employ different compression dimensions.

According to an implementation manner of the embodiment of the present application, before the training data is used to perform fine-tuning training on the enhancement model, the method further includes:

pre-training the enhancement model by using the training data, wherein only parameters of the multi-view compression representation module are updated in the pre-training process, and the training target is to minimize the difference between the input and the output of the multi-view compression representation module;

performing fine tuning training on the enhancement model using the training data includes: and performing fine tuning training on the enhancement model obtained by pre-training by utilizing the training data.

According to one possible implementation manner in the embodiments of the present application, during the pre-training and the fine-tuning training, a previous hidden layer output of the multi-view compressed representation module is implicitly expressed to the multi-view compressed representation module, and the multi-view compressed representation module randomly selects one or none of the N levels of self-encoders to output the hidden layer to a next hidden layer.

According to an implementation manner in the embodiment of the present application, the hierarchical self-encoder includes an encoding module, an intra-layer encoding module, and a decoding module, where the intra-layer encoding module includes M intra-layer self-encoders, and M is a positive integer;

The coding module in the hierarchical self-encoder, which is input with the implicit expression, outputs the implicit expression to the intra-layer coding module, and the intra-layer coding module randomly selects one or randomly non-selected intra-layer self-encoders from the M intra-layer self-encoders to output the implicit expression to the decoding module.

According to an implementation manner in the embodiment of the present application, the adding a multi-view compressed representation module between at least two hidden layers of the pre-training language model includes:

adding a multi-view compression representation module between a top hidden layer and an adjacent hidden layer of the pre-training language model; and/or the number of the groups of groups,

and adding a multi-view compressed representation module between the bottom hidden layer of the pre-training language model and the adjacent hidden layer.

According to one implementation in the embodiment of the present application, the compression dimensions of the 3, 3-level self-encoder are 128, 256, and 512, respectively.

According to an implementation manner in the embodiment of the present application, the training data is a text sequence pair, and the expected value is a relationship type of the text sequence pair; or,

the training data is a text sequence, and the expected value is the emotion type of the text sequence; or,

the training data is a text sequence, and the expected value is a named entity in the text sequence; or,

The training data is a text sequence, and the expected value is the part of speech of at least one word in the text sequence.

In a second aspect, a fine tuning apparatus for a pre-trained language model is provided, the apparatus comprising:

the acquisition unit is configured to acquire a pre-constructed enhancement model, wherein the enhancement model is obtained by adding a multi-view compression representation module between at least two hidden layers of a pre-training language model, the multi-view compression representation model comprises N layers of self-encoders, and N is a positive integer;

the fine tuning unit is configured to perform fine tuning training on the enhancement model by using training data to obtain a target model, wherein the target model comprises the enhancement model obtained by pre-training and a downstream prediction model; updating parameters of the pre-training language model, the multi-view compression representation module and the downstream prediction model in the fine-tuning training process, wherein a training target is to minimize the difference between an output result of the downstream prediction module and an expected value; and after the fine tuning training is finished, removing the multi-view compression representation module from the target model obtained by training to obtain a prediction model.

According to an implementation manner in an embodiment of the present application, the apparatus further includes:

A pre-training unit configured to pre-train the enhancement model using the training data, only updating parameters of the multi-view compressed representation module during the pre-training, the training objective being to minimize differences between the input and the output of the multi-view compressed representation module;

the fine tuning unit is specifically configured to perform fine tuning training on the pre-trained enhancement model by using the training data to obtain a target model.

According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the first aspects described above.

According to a fourth aspect, there is provided an electronic device characterized by comprising:

one or more processors; and

a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of the first aspects above.

According to a specific embodiment provided by the application, the application discloses the following technical effects:

1) According to the method and the device, N layers of the self-encoders are added between at least two layers of the pre-training language model, so that the diversity of the implicit expression can be increased when the implicit expression is processed in the fine tuning process, the noise in the implicit expression can be reduced by utilizing the layers of the self-encoders, the weight of the noise can be reduced when a downstream prediction module learns, the phenomenon of overfitting is reduced, and the robustness and the prediction accuracy of the prediction model obtained by fine tuning are improved.

2) Before fine tuning the enhancement model in the present application, the enhancement model is first pre-trained, and only the parameters of the multi-view compressed representation module are updated to minimize the difference between the input and output of the multi-view compressed representation module. The pre-training can keep the prior knowledge of the pre-training language model to the greatest extent, and reduce the influence of the multi-view compression representation module on the prior knowledge of the original pre-training language model, thereby ensuring the prediction effect of the finally obtained prediction model.

3) According to the method and the device, N layers of self-encoders added between at least two layers of the pre-training language model adopt different compression dimensions, so that the diversity of implicit expression is further improved, and the phenomenon of overfitting is reduced.

4) In the pre-training and fine-tuning training process, one or no hierarchical self-encoder is randomly selected in the multi-view compression representation module for processing, so that the diversity of implicit expression is further improved, and the phenomenon of overfitting is reduced.

5) The intra-layer coding module of the hierarchical self-encoder comprises M intra-layer self-encoders, and one intra-layer self-encoder or the other intra-layer self-encoder is randomly selected from the intra-layer self-encoders to process in the pre-training and fine-tuning training processes, so that the diversity of implicit expression and the information retention degree are further improved, and the phenomenon of overfitting is reduced.

6) Because the bottom layer module mainly plays a role in data generation, the multi-view compressed representation module is inserted between the bottom layer hidden layer and the adjacent hidden layers, so that more diversity can be added to the model. And the multi-view compressed representation module is inserted between the top hidden layer and the adjacent hidden layers, so that the overfitting of the model on the downstream task can be more effectively slowed down.

7) The method and the device can be applied to relation type recognition of text sequence pairs, emotion type recognition of text sequences, named entity recognition of text sequences, part-of-speech recognition of at least one word in the text sequences and the like, and the prediction effect of the recognition can be improved through the prediction model obtained in the fine tuning mode based on the pre-training language model.

Of course, not all of the above-described advantages need be achieved at the same time in practicing any one of the products of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an exemplary system architecture diagram to which embodiments of the present application may be applied;

FIG. 2 is a flowchart of a method for fine tuning a pre-trained language model according to an embodiment of the present application;

FIG. 3 is a schematic block diagram of an enhancement model provided by an embodiment of the present application;

FIG. 4 is a schematic block diagram of a hierarchical self-encoder provided by an embodiment of the present application;

FIG. 5 is a schematic block diagram of a target model provided by an embodiment of the present application;

FIG. 6 is a flowchart of a fine tuning device for a pre-training language model according to an embodiment of the present application;

fig. 7 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application are within the scope of the protection of the present application.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

Fine tuning of a pre-trained language model is an effective way to transfer knowledge of a large-scale text corpus to downstream NLP (natural language processing task) tasks. Most language models are designed to be generic across multiple NLP tasks, so when feature extraction is performed in a downstream task using a pre-trained language model, the resulting implicit expression (hidden representation) tends to contain a large number of features that are not relevant to the downstream task, thus resulting in overfitting. By overfitting is meant that as the training data is less, the error rate of the model over the training data gradually decreases as the training process progresses, but the learned knowledge is less generalized, resulting in an increase in error rate over the test data.

The existing modes mainly comprise the following two modes:

one is the Dropout approach, i.e. setting the partial value in the implicit expression to 0 at random, which is simple to implement, but adds noise indiscriminately to both task related and uncorrelated features.

The other is the Mixout approach, which is to randomly replace part of the model parameters with initial values during fine-tuning of the pre-trained language model. However, this approach requires the initial model parameters to be saved, increasing memory usage.

In view of this, the present application provides a new idea to achieve increasing training data diversity and reducing noise in features in implicit expressions by inserting a dynamic random neural network between hidden layers of a pre-training language model.

To facilitate an understanding of the present application, a brief description of a system architecture to which the present application applies is first provided. FIG. 1 illustrates an exemplary system architecture to which embodiments of the present application may be applied, including a model training device that performs model training in an offline manner, and a prediction processing device that performs predictions online, as shown in FIG. 1.

After training data is acquired, the model training device can adopt the fine tuning method of the pre-training language model provided by the embodiment of the application to process, so as to obtain a prediction model.

The prediction processing means generates a prediction result using the prediction model that has been established after receiving the prediction request.

The model training device and the prediction processing device can be respectively set as independent servers, can be set in the same server or server group, and can be set in independent or same cloud servers. The cloud server is also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual special server (VPs, virtual Private Server) service. The model training device and the prediction processing device can also be arranged on a computer terminal with stronger computing power.

It should be understood that the number of model training means, prediction processing means, pre-training language models and prediction models in fig. 1 is merely illustrative. There may be any number of model training means, predictive processing means, pre-trained language models and predictive models, as desired for implementation.

FIG. 2 is a flowchart of a method for fine tuning a pre-trained language model according to an embodiment of the present application, as shown in FIG. 2, the method may include the following steps:

step 202: and obtaining a pre-constructed enhancement model, wherein the enhancement model is obtained by adding a multi-view compressed representation module between at least two hidden layers of the pre-trained language model, and the multi-view compressed representation model comprises N layers of self-encoders, and N is a positive integer.

Step 206: performing fine tuning training on the enhancement model by utilizing training data to obtain a target model, wherein the target model comprises the enhancement model obtained by pre-training and a downstream prediction model; and updating parameters of the pre-training language model, the multi-view compression representation module and the downstream prediction model in the fine-tuning training process, wherein the training target is to minimize the difference between the output result of the downstream prediction module and the expected value.

Step 208: and after the fine tuning training is finished, removing the multi-view compression representation module from the target model obtained by the training to obtain a prediction model.

According to the method and the device, N layers of self-encoders are added between at least two layers of the pre-training language model, so that the diversity of the implicit expression can be increased when the implicit expression is processed in the fine tuning process, noise in the implicit expression can be reduced by using the layers of self-encoders, the weight of the noise can be reduced when a downstream prediction module learns, the phenomenon of fitting is reduced, and the robustness of fine tuning is improved.

In addition, in the whole training process, random replacement is not needed by using initial model parameters, so that compared with a Mixout mode, the memory usage amount can be reduced.

The steps in the above-described flow are described in detail one by one. The above step 202, i.e. "acquire pre-built enhancement model", will first be described in detail in connection with an embodiment.

The enhancement model referred to in the embodiments of the present application is built on the basis of a pre-trained language model. The Pre-Training language model may be UniLM (Unified Language Model ), GPT (generated Pre-Training model), BERT (Bidirectional Encoder Representation from Transformers, conversion-based bi-directional coded representation), XLNet, etc.

The pre-training language model is one of the cores of the NLP, and because of the unsupervised training attribute, the pre-training language model is very easy to acquire massive training samples, and the trained language model contains a plurality of semantic grammar knowledge, so that the effect on downstream tasks is obviously improved. However, the pre-training language model is easy to generate over-fitting in fine-tuning training under a low-resource scene, that is, when training data is relatively less. After analysis and research, it is found that the effect of data enhancement on implicit expression can be effectively achieved by inserting the multi-view compressed representation module between hidden layers of the pre-training language model.

The pre-trained language model is typically a multi-layer structure. When data is entered into the model, each layer of the model processes the output of the previous layer and outputs it to the next layer. Each layer in the pre-trained language model is referred to as an hidden layer, and the information passed between hidden layers is typically in the form of vectors, referred to as implicit expressions, and may also be referred to as hidden vectors, hidden vector representations, and so on. Implicit expressions are used for describing the embodiments of the present application.

In order to enhance the implicit expression diversity while reducing noise in the feature, a multi-view compressed representation module may be added between at least two hidden layers, each multi-view compressed representation module comprising N self-encoders, N being a positive integer. It is verified that when the number of hierarchical self-encoders exceeds 3, the effect improvement of the model is no longer significant, and therefore the value of N may be preferably 3.

In order to distinguish from the subsequently referred to self-encoders, the N self-encoders are referred to herein as HAEs (Hierarchical AutoEncoder, hierarchical self-encoders). As shown in fig. 3, the pre-trained language model includes S hidden layers, denoted as hidden layer 1 through hidden layer S, respectively. A multi-view compressed representation module is added between an nth hidden layer and an n+1th hidden layer of the pre-training language model, and each multi-view compressed representation module comprises 3 HAEs as an example. The I block in the multi-view compression representation block represents a pass-through channel, the role of which will be specifically referred to in the subsequent embodiments.

As one of the realizations, the N-level self-encoders described above may employ the same compression dimension.

But to further increase the diversity of the implicit expression, the N-level self-encoders described above may employ different compression dimensions. It was verified that the preferred combination of compression dimensions is 128, 256 and 512 when N takes 3.

A self-encoder is a special neural network that reduces noise in the input signature that is not related to downstream tasks by compressing the input to a lower dimension and then reverting to the same dimension as the input.

As one of the realizations, the hierarchical self-encoder may be a general self-encoder, that is, a self-encoder composed of an encoding module and a decoding module.

However, in order to further increase the diversity of implicit expressions and the degree of information retention, the present application provides a more preferred embodiment, i.e. each level self-encoder may include an encoding module, an intra-layer encoding module and a decoding module as shown in fig. 4. Wherein the intra-layer coding module may include M intra-layer self-encoders, M being a positive integer. In fig. 4, 2 is taken as an example of M. The I-block in the intra-layer coding block represents a pass-through channel, the function of which will be specifically referred to in the following embodiments.

The M intra-layer self-encoders described above may employ the same compression dimension, or may employ different compression dimensions. As one of the realizations, the compression dimension of the intra-layer self-encoder may take half of the compression dimension of the belonging HAE.

The multi-view compressed representation module has great flexibility and can be inserted between any two hidden layers of the pre-training language model. Through verification, a multi-view compression representation module is inserted between the top hidden layer and the adjacent hidden layer (namely the penultimate layer), or between the bottom hidden layer and the adjacent hidden layer (namely the second layer), so that a better effect can be achieved. The underlying module is mainly used for generating data, so that the multi-view compressed representation module is inserted between the underlying hidden layer and the adjacent hidden layer, and more diversity can be added to the model. And the multi-view compressed representation module is inserted between the top hidden layer and the adjacent hidden layers, so that the overfitting of the model on the downstream task can be more effectively slowed down.

In the embodiment shown in fig. 2, the fine tuning training is directly performed on the enhancement model to obtain the target model. The method can alleviate the phenomenon of over fitting to a certain extent and improve the robustness of fine tuning. In order to better ensure the model effect, between the

above steps

202 and 206, a step 204, as indicated by a dashed box in fig. 2, may be further included, which is a preferred step.

Step 204: the enhancement model is pre-trained with training data, only parameters of the multi-view compressed representation module are updated during the pre-training process, and the training goal is to minimize differences between the input and the output of the multi-view compressed representation module.

The pre-training of the enhancement model in the step is to pre-train the multi-view compression representation module in the enhancement model in practice, so as to keep the prior knowledge of the pre-training language model to the greatest extent, avoid unreasonable influence on the implicit expression of the pre-training language model caused by the addition of the multi-view compression representation module, and further ensure the prediction effect of the prediction model finally obtained after the subsequent fine adjustment.

The training data involved therein is determined by the downstream predictive task, and the following list of several scenarios of downstream predictive tasks:

First kind: the training data includes a large number of text sequence pairs and labels for the relationship types of the text sequence pairs, which are expected values for subsequent fine-tuning training.

Wherein The plurality of text sequence pairs may be taken from, for example, an SNLI (Stanford Natural Language Inference, stanford natural language reasoning) dataset, an MNLI (The Multi-Genre Natural Language Inference, multi-type natural language reasoning) dataset, etc. The relationship types of the text sequence pairs may include, for example, implication relationships, contradictory relationships, neutral relationships, and the like.

The relationship type of the input text sequence pair can be predicted by utilizing the prediction model obtained by training the training data.

Second kind: the training data includes a number of text sequences and labels for emotion types of the text sequences, which are expected values for subsequent fine-tuning training.

These large numbers of text sequences may be taken from, for example, IMDB (Internet Movie DataBase ) datasets, yelp datasets, etc., mostly sentence content, to construct the text sequences. The IMDB dataset contains 50000 severely dipolar reviews from the internet movie database, which represent user-specific emotion types. The yellow dataset consisted of approximately 16 ten thousand business, 863 ten thousand reviews, and 20 ten thousand picture data from an 8 metropolitan area. Wherein 863 tens of thousands of comments also embody the user-specific emotion type. Such as likes, dislikes, neutral, etc.

The prediction model obtained by training the training data can be used for predicting the emotion type corresponding to the input text sequence.

Third kind: the training data includes a number of text sequences and labels for named entities in the text sequences as expected values for subsequent fine-tuning training.

These large numbers of text sequences may be taken from, for example, the wikian dataset. The wikian dataset is a multilingual named entity recognition dataset, and is composed of wiki articles, and the articles mark the positions, tasks, organizations and the like in the articles.

The named entity in the input text sequence can be predicted by using the prediction model obtained by training the training data.

Fourth kind: the training data includes a plurality of text sequences and labels for parts of speech of at least one word in the text sequences.

These large numbers of text sequences may be taken from, for example, universal Dependencies v2.5 datasets. Universal Dependencies v 2.5.5 the dataset contains labels for parts of speech, morphological features, syntactic features, etc. for text in multiple languages. The embodiment of the application uses the labels of the parts of speech.

The part of speech of the target word in the input text sequence can be predicted by using the prediction model obtained by training the training data.

In the pre-training process, in the training process, an input sequence is constructed by using text sequence pairs or text sequences in training data, and the input sequence is input into an enhancement model.

The last hidden layer of the multiview compression representation module is assumed to be the nth hidden layer, and after the nth hidden layer outputs the implicit expression to the multiview compression representation module, the multiview compression representation module randomly selects one or randomly does not select HAEs from the N HAEs contained. If one HAE is selected as a result of the randomization, the selected HAE processes the input implicit expression and outputs the processed implicit expression to the n+1th hidden layer. If the random result is that HAE is not selected, the implicit expression of the nth hidden layer output takes the form of an I block as shown in fig. 3, which represents a through channel, i.e., the implicit expression of the nth hidden layer output is directly input to the n+1th hidden layer. Such randomness can effectively increase the diversity of implicit expressions.

If the intra-layer self-encoder of the HAE adopts the structure shown in fig. 4, that is, includes M intra-layer self-encoders, for the hierarchical self-encoder (that is, the hierarchical self-encoder selected randomly) to which the implicit expression is input, the encoding module encodes the implicit expression and outputs the implicit expression to the intra-layer encoding module. The intra-layer encoding module randomly selects one or none of the M intra-layer self-encoders for processing and outputs an implicit expression to the decoding module. If the result of the random selection is one of the intra-layer self-encoders, the selected intra-layer self-encoder processes the input implicit expression and outputs the processed implicit expression to the decoding module. If the result of the random selection is that the intra-layer self-encoder is not selected, the implicit expression output by the encoding module goes away from the I module as shown in fig. 4, which represents a through channel, and the implicit expression output by the encoding module is directly input to the decoding module.

The initial parameters of the multi-view compressed representation module may be randomly initialized parameters during pre-training. In the process of pre-training the enhancement model, parameters of the pre-training language model are kept unchanged, and only parameters of the multi-view compression representation module are updated. Because the goal of the pre-training is to make the addition of the multi-view compressed representation module unable to have an unreasonable impact on the implicit expression of the original pre-trained language model, i.e., the differences in the inputs and outputs of the multi-view compressed representation module should be as small as possible. Thus, the parameters of the multi-view compressed representation module may be optimized using a reconstruction loss function, L _MSE Can be expressed as follows:

wherein,,

HAE for implicit expression of the ith Token (element) output in the input sequence for the enhancement model for the nth hidden layer _m () For the processing function of the mth HAE in the multi-view compressed representation module, M is the number of HAEs in the multi-view compressed representation module, and L is the length of the input sequence of the enhancement model, i.e. the number of Token included.

And updating parameters of a multi-view compression representation module in the enhancement model in a gradient descent mode by utilizing the value of the reconstruction loss function in each round of iteration of pre-training until a preset training ending condition is met. The training ending condition may include, for example, the value of the loss function being less than or equal to a preset loss function threshold, the number of iterations reaching a preset number of times threshold, etc.

The above step 206, i.e. "fine tuning training of a target model with training data", is described in detail below in connection with an embodiment.

After the pre-training is finished, the target model can be constructed by utilizing the enhancement model obtained by the pre-training, and can be constructed according to a specific downstream prediction task, and the downstream prediction model is added on the basis of the enhancement model to obtain the target model, as shown in fig. 5.

Fine-tuning (fine-tuning) is a common training method in the deep learning field, that is, a pre-training language model is applied to a downstream task (in this embodiment, the downstream task is a prediction task corresponding to a downstream prediction model), and then parameters of the pre-training language model are continuously adjusted with a smaller learning rate, so as to optimize performance of the pre-training language model in the downstream task. This process is called trimming.

In the fine tuning training process, an input sequence is constructed by utilizing text sequence pairs or text sequences in training data, the input sequence is input into an enhancement model, the enhancement model outputs implicit expression of the input sequence to a downstream prediction model, and the downstream prediction model obtains a prediction result by utilizing the implicit expression. The downstream prediction module can adopt a regression model and a classification model according to specific training tasks. The training goal of the fine-tuning training is to minimize the difference between the predicted result and the expected value of the downstream prediction module. For example, for a relationship type prediction of text sequence pairs, the difference in the predicted result from the relationship type noted in the training data is minimized. For another example, for prediction of emotion types for a text sequence, differences in the predicted results from the emotion types noted in the training data are minimized. For another example, for prediction of named entities in a text sequence, differences in the predicted results from named entities noted in the training data are minimized. For another example, for the prediction of the part of speech of at least one word in the text sequence, the difference in part of speech noted in the predicted result and the training data is minimized.

In this embodiment of the present disclosure, the loss function may be constructed according to the training target, and the model parameters may be updated by using the value of the loss function in each iteration, and using a manner such as gradient descent, until a preset training end condition is satisfied. The training ending condition may include, for example, the value of the loss function being less than or equal to a preset loss function threshold, the number of iterations reaching a preset number of times threshold, etc.

And updating all model parameters in the fine tuning training process, wherein the parameters comprise parameters of a language model, a multi-view compressed representation module and a downstream prediction model.

Similarly to the pre-training process, the last hidden layer of the multi-view compressed representation module assumes an nth hidden layer, and after the nth hidden layer outputs the implicit expression to the multi-view compressed representation module, the multi-view compressed representation module randomly selects one or none HAE from the N HAEs included. If one HAE is selected as a result of the randomization, the selected HAE processes the input implicit expression and outputs the processed implicit expression to the n+1th hidden layer. If the random result is that HAE is not selected, the implicit expression of the nth hidden layer output takes the form of an I block as shown in fig. 3, which represents a through channel, i.e., the implicit expression of the nth hidden layer output is directly input to the n+1th hidden layer. Such randomness can effectively increase the diversity of implicit expressions.

If the intra-layer self-encoder of the HAE adopts the structure shown in fig. 4, that is, includes M intra-layer self-encoders, for the hierarchical self-encoder (that is, the hierarchical self-encoder selected randomly) to which the implicit expression is input, the encoding module encodes the implicit expression and outputs the implicit expression to the intra-layer encoding module. The intra-layer encoding module randomly selects one or none of the M intra-layer self-encoders from the M intra-layer self-encoders for processing and outputs an implicit expression to the decoding module. If the result of the random selection is one of the intra-layer self-encoders, the selected intra-layer self-encoder processes the input implicit expression and outputs the processed implicit expression to the decoding module. If the result of the random selection is that the intra-layer self-encoder is not selected, the implicit expression output by the encoding module goes away from the I module as shown in fig. 4, which represents a through channel, and the implicit expression output by the encoding module is directly input to the decoding module.

By adding the self-encoder, noise in implicit expression is reduced, so that the weight of the noise can be reduced when a downstream prediction module learns in the fine-tuning training process, and the phenomenon of over-fitting is reduced.

Following the above step 208, "after the fine tuning training is finished, the multi-view compressed representation module is removed from the target model obtained by training to obtain the prediction model".

After the multi-view compression representation model is removed, a pre-training language model obtained through fine-tuning training and a downstream prediction module form a prediction model, and the prediction model can output corresponding prediction results aiming at an input sequence. For example, for an input sequence composed of text sequence pairs, the relationship type of the text sequence pair is predicted. For example, for an input sequence of text sequences, the emotion type of the text sequence is predicted. For another example, for an input sequence of text sequences, named entities in the text sequences are predicted. For another example, for an input sequence of text sequences, the part of speech of at least one word in the text sequence is predicted. Etc.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

According to an embodiment of another aspect, a fine tuning device of a pre-trained language model is provided. FIG. 6 illustrates a schematic block diagram of a fine tuning device for a pre-trained language model according to one embodiment, which is disposed in the model training device in the architecture shown in FIG. 1. As shown in fig. 6, the apparatus 600 includes: the acquisition unit 601 and the fine tuning unit 603 may further comprise a pre-training unit 602. Wherein the main functions of each constituent unit are as follows:

the obtaining unit 601 is configured to obtain a pre-built enhancement model, where the enhancement model is a model obtained by adding a multi-view compressed representation module between at least two hidden layers of the pre-training language model, and the multi-view compressed representation model includes N levels of self-encoders, where N is a positive integer.

And a fine tuning unit 603 configured to perform fine tuning training on the enhancement model by using the training data, to obtain a target model. The target model comprises an enhancement model and a downstream prediction model which are obtained through pre-training; updating parameters of a pre-training language model, a multi-view compression representation module and a downstream prediction model in the process of fine-tuning training, wherein the training target is to minimize the difference between the output result of the downstream prediction module and an expected value; and after the fine tuning training is finished, removing the multi-view compression representation module from the target model obtained by the training to obtain a prediction model.

As one of the realizations, the pre-training unit 602 is configured to pre-train the enhancement model with training data, only the parameters of the multi-view compressed representation module are updated during the pre-training, the training objective being to minimize the difference between the input and the output of the multi-view compressed representation module.

Accordingly, the fine tuning unit 603 is specifically configured to perform fine tuning training on the pre-trained enhancement model by using the training data, so as to obtain the target model.

As one of the preferred ways, the N-level self-encoders employ different compression dimensions.

As one of the realizations, during the pre-training and fine-tuning training, the last hidden layer output of the multiview compressed representation module implicitly expresses to the multiview compressed representation module, which randomly selects one or no hierarchy from the N hierarchies of self-encoders to output the implicit expression to the next hidden layer.

As a more preferable mode, the hierarchical self-encoder comprises an encoding module, an intra-layer encoding module and a decoding module, wherein the intra-layer encoding module comprises M intra-layer self-encoders, and M is a positive integer.

The coding module in the hierarchical self-encoder, to which the implicit expression is input, outputs the implicit expression to the intra-layer coding module, and the intra-layer coding module randomly selects one or none of the M intra-layer self-encoders to output the implicit expression to the decoding module.

As a more preferred mode, the multi-view compressed representation module is added between at least two hidden layers of the pre-training language model and comprises the following components:

adding a multi-view compression representation module between a top hidden layer of the pre-training language model and an adjacent hidden layer of the pre-training language model; and/or the number of the groups of groups,

a multi-view compressed representation module is added between the underlying hidden layer of the pre-training language model and the adjacent hidden layer.

As one of the preferred modes, the compression dimensions of the 3, 3-level self-encoder are 128, 256 and 512, respectively.

Depending on the application scenario, the following situations may be included, but are not limited:

the training data is a text sequence pair, and the expected value is the relation type of the text sequence pair; or,

training data is a text sequence, and expected values are emotion types of the text sequence; or,

In addition, the embodiment of the application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method of any one of the foregoing method embodiments.

And an electronic device comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of the preceding method embodiments.

The present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the preceding method embodiments.

Fig. 7 illustrates an architecture of an electronic device, which may include a processor 710, a video display adapter 711, a disk drive 712, an input/output interface 713, a network interface 714, and a memory 720, among others. The processor 710, the video display adapter 711, the disk drive 712, the input/output interface 713, the network interface 714, and the memory 720 may be communicatively connected via a communication bus 730.

The processor 710 may be implemented by a general-purpose CPU, a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided herein.

The Memory 720 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. The memory 720 may store an operating system 721 for controlling the operation of the electronic device 700, and a Basic Input Output System (BIOS) 722 for controlling the low-level operation of the electronic device 700. In addition, a web browser 723, a data storage management system 724, and a fine tuning device 725 for the pre-trained language model, etc. may also be stored. The foregoing fine tuning device 725 of the pre-training language model may be an application program that specifically implements the foregoing operations of each step in the embodiments of the present application. In general, when implemented in software or firmware, the relevant program code is stored in memory 720 and executed by processor 710.

The input/output interface 713 is used to connect with an input/output module to enable information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

The network interface 714 is used to connect communication modules (not shown) to enable communication interactions of the device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 730 includes a path to transfer information between various components of the device (e.g., processor 710, video display adapter 711, disk drive 712, input/output interface 713, network interface 714, and memory 720).

It should be noted that although the above devices illustrate only the processor 710, the video display adapter 711, the disk drive 712, the input/output interface 713, the network interface 714, the memory 720, the bus 730, etc., the device may include other components necessary to achieve proper operation in an implementation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the present application, and not all the components shown in the drawings.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer program product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and include several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part. The systems and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing has outlined the detailed description of the preferred embodiment of the present application, and the detailed description of the principles and embodiments of the present application has been provided herein by way of example only to facilitate the understanding of the method and core concepts of the present application; also, as will occur to those of ordinary skill in the art, many modifications are possible in view of the teachings of the present application, both in the detailed description and the scope of its applications. In view of the foregoing, this description should not be construed as limiting the application.

Claims

1. A method of fine tuning a pre-trained language model, the method comprising:

2. The method of claim 1, wherein the N levels of self-encoders employ different compression dimensions.

3. The method of claim 1, further comprising, prior to said fine-tuning said enhancement model with training data:

4. A method according to claim 3, wherein during the pre-training and the fine-tuning training, a previous hidden layer output of the multi-view compressed representation module is implicitly expressed to the multi-view compressed representation module, which randomly selects one or none of the N levels of self-encoders from the multi-view compressed representation module to the next hidden layer.

5. The method of claim 1, 3 or 4, wherein the hierarchical self-encoder comprises an encoding module, an intra-layer encoding module and a decoding module, the intra-layer encoding module comprising M intra-layer self-encoders, the M being a positive integer;

6. The method of claim 1, wherein adding a multi-view compressed representation module between at least two hidden layers of the pre-trained language model comprises:

7. The method of claim 2, wherein the compression dimensions of the N3, 3-level self-encoder are 128, 256, and 512, respectively.

8. The method of any one of claims 1 to 4, 6 and 7, wherein the training data is a text sequence pair and the expected value is a relationship type of the text sequence pair; or,

9. A device for fine tuning a pre-trained language model, the device comprising:

10. The apparatus of claim 9, wherein the apparatus further comprises:

11. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 8.

12. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of claims 1 to 8.