CN116362351A

CN116362351A - Method and device for training pre-training language model by using noise disturbance

Info

Publication number: CN116362351A
Application number: CN202310614779.5A
Authority: CN
Inventors: 吴亚军; 暴宇健; 王芳; 徐琳
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-06-30
Anticipated expiration: 2043-05-29
Also published as: CN116362351B

Abstract

The application relates to the technical field of machine learning, and provides a method and a device for training a pre-training language model by using noise disturbance. The method comprises the following steps: acquiring a training data set and a pre-training language model corresponding to a target task; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the pre-training language model are optimized using the training data set. By adopting the technical means, the problems that in the prior art, the model obtained by further training a pre-trained large-scale model often has over-fitting and low generalization capability are solved.

Description

Method and device for training pre-training language model by using noise disturbance

Technical Field

The present disclosure relates to the field of machine learning technologies, and in particular, to a method and apparatus for training a pre-training language model using noise disturbance.

Background

In recent years, with the development of machine learning technology, more and more large-scale models are being applied to the language field. In order to ensure that the large-scale model meets the requirements and to improve the training efficiency of the model, it is currently common to further train the already pre-trained large-scale model. However, the model obtained by further training the pre-trained large-scale model often has the problems of over fitting and low generalization capability.

Disclosure of Invention

In view of this, the embodiments of the present application provide a method, apparatus, electronic device, and computer readable storage medium for training a pre-training language model by using noise disturbance, so as to solve the problem in the prior art that the pre-trained large-scale model is further trained, and the finally obtained model often has over-fitting and low generalization capability.

In a first aspect of an embodiment of the present application, there is provided a method for training a pre-training language model using noise disturbance, including: acquiring a training data set and a pre-training language model corresponding to a target task; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the pre-training language model are optimized using the training data set.

In a second aspect of the embodiments of the present application, there is provided an apparatus for training a pre-training language model using noise disturbance, including: the acquisition module is configured to acquire a training data set and a pre-training language model corresponding to the target task; the computing module is configured to compute noise disturbance corresponding to each parameter matrix in the pre-training language model and update the parameter matrix according to the noise disturbance corresponding to each parameter matrix; the training module is configured to optimize bias items and updated parameter matrices in the pre-training language model using the training dataset based on the target task.

In a third aspect of the embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.

Compared with the prior art, the embodiment of the application has the beneficial effects that: because the training data set and the pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, the bias items and the updated parameter matrix in the pre-training language model are optimized by utilizing the training data set, so that the problems that the model is often over-fitted and low in generalization capability in the final model can be solved by further training the pre-trained large-scale model in the prior art by adopting the technical means, and the over-fitting of the model is avoided and the generalization capability of the model is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a method for training a pre-training language model using noise perturbations provided in an embodiment of the present application;

FIG. 2 is a flow chart of another method for training a pre-training language model using noise disturbance provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a pre-training language model according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an apparatus for training a pre-training language model using noise disturbance according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

FIG. 1 is a flow chart of a method for training a pre-trained language model using noise perturbations, as provided in an embodiment of the present application. The method of training the pre-trained language model with noise perturbations of fig. 1 may be performed by a computer or server, or software on a computer or server. As shown in fig. 1, the method for training a pre-training language model by using noise disturbance includes:

s101, acquiring a training data set and a pre-training language model corresponding to a target task;

s102, calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix;

s103, optimizing bias items and updated parameter matrixes in the pre-training language model by utilizing the training data set based on the target task.

The pre-trained language model has a large number of bias terms and parameter matrices. The optimization of the parameter matrix in the pre-training language model is performed later, and the optimization of the updated parameter matrix in the pre-training language model is performed.

The bias term is bias unit or bias term or interference term, which is consistent with the meaning of b in the linear equation y=wx+b. In the linear equation y=wx+b, b represents the intercept of the function on the y-axis, controlling the distance of the function from the origin. The neural network model (the pre-training language model is a pre-training model, the pre-training model is a neural network model after being pre-trained) may also be represented by y=wx+b, unlike the linear equation, W and b in the neural network model represent matrices, and the trainable parameters of the neural network model may also be represented as: (W, b), wherein W represents a parameter matrix and b represents a bias term. Parameters of the neural network model are divided into fixed parameters and trainable parameters, the trainable parameters including: a parameter matrix and a bias term. Training of neural network models is the process of optimizing trainable parameters.

The method and the device can be used in any scene in the language field, such as text translation, word order prediction, next sentence prediction, question and answer tasks, named entity recognition tasks, text classification and the like. For example, in a text translation scenario, the target task is a text translation task; the training data set is a labeling corpus of text translation; the pre-training language model is a model obtained by pre-training the language model based on a text translation task; optimizing a parameter matrix and bias items in the pre-training language model by utilizing a training data set based on a text translation task; the final trained model is used for text translation. Other scenes are similar to the text translation scene.

According to the technical scheme provided by the embodiment of the application, a training data set and a pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the pre-training language model are optimized using the training data set. According to the method and the device, the noise disturbance is added into the parameter matrix, so that the influence of the pre-training on the overfitting and generalization capability of the language model is weakened, and the problems that the overfitting and the generalization capability of the model are low due to the fact that the model is often overfitted and the generalization capability of the model is improved due to the fact that the pre-trained large-scale model is further trained in the prior art can be solved.

Further, the noise disturbance corresponding to each parameter matrix in the pre-training language model is calculated by the following formula:

；

wherein ,

noise disturbance corresponding to the ith parameter matrix, < ->

For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in a pre-trained language model, & lt/EN & gt>

Is the standard deviation of the data inside the ith parameter matrix.

Further, each parameter matrix is updated by the following formula:

；

wherein ,

noise disturbance corresponding to the ith parameter matrix, < ->

For the i-th parameter matrix before updating, < +.>

Is the i parameter matrix after updating.

If sum->

Is inconsistent in dimension, can be applied to->

Filling is performed such that->

and />

Is uniform in dimension.

Further, optimizing bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task, comprising: dividing the training data set into a first training data set and a second training data set according to a first preset proportion, and carrying out multi-stage training on the pre-training language model: freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using a first training data set based on a target task to complete first-stage training of the pre-training language model; after the first stage training is completed, thawing the parameter matrix in the pre-training language model, and optimizing the bias items and the parameter matrix in the pre-training language model by using the second training data set based on the target task to complete the second stage training of the pre-training language model.

The first predetermined ratio is related to the ratio of the parameter matrix to the trainable parameter and the ratio of the bias term to the trainable parameter in the pre-training language model. For example, the first preset ratio may be 1:9, and the ratio of the data amount of the first training data set to the second training data set is 1:9.

In this embodiment, the first stage training: freezing the parameter matrix, and training only the bias items by using the first training data set; after the first stage training is completed, thawing the parameter matrix; training in the second stage: and training the bias items and the parameter matrix by using a second training data set, wherein the second stage training is the training of the whole pre-training language model.

Further, optimizing bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task, comprising: dividing the training data set into a first training data set and a second training data set according to a third preset proportion, and carrying out multi-stage training on the pre-training language model: freezing bias items in the pre-training language model, and optimizing a parameter matrix in the pre-training language model by using a first training data set based on a target task so as to complete first-stage training of the pre-training language model; after the first stage training is completed, the bias items in the pre-training language model are unfrozen, and based on the target task, the bias items and the parameter matrix in the pre-training language model are optimized by using the second training data set so as to complete the second stage training of the pre-training language model.

In this embodiment, the first stage training: freezing the bias term, and training only the parameter matrix by using the first training data set; after the first stage training is completed, thawing the bias items; training in the second stage: and training the bias items and the parameter matrix by using a second training data set, wherein the second stage training is the training of the whole pre-training language model.

Further, optimizing bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task, comprising: determining a data volume of the training data set; training the pre-training language model according to the data volume: when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing a training data set based on a target task; and when the data volume is not smaller than the first preset size, optimizing the bias items and the updated parameter matrix in the pre-training language model by utilizing the training data set based on the target task.

The parameter matrix accounts for more than ninety-nine percent of the trainable parameters in the pre-training language model, and the bias term accounts for less than one percent. According to the method, when the data size is smaller than the first preset size, only bias items in the pre-training language model are optimized by utilizing the training data set (the method is applied to a small sample scene which is a condition with a small training sample number), the optimized parameter number and the training time consumption can be greatly reduced, and meanwhile model overfitting can be avoided when the training sample number is small. Through practice, the method can achieve good effect by optimizing only the bias items in the pre-training language model.

Further, optimizing bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task, comprising: determining a data volume of the training data set; training the pre-training language model according to the data volume: when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing a training data set based on a target task; when the data volume is larger than or equal to the first preset size but smaller than the second preset size, freezing the bias items in the pre-training language model, and optimizing an updated parameter matrix in the pre-training language model by utilizing the training data set based on the target task; and when the data volume is larger than or equal to a second preset size, optimizing the bias items and the updated parameter matrix in the pre-training language model by using the training data set based on the target task.

According to the embodiment of the application, the corresponding training method is selected according to the data volume of the training data set, so that training efficiency is improved.

Further, before obtaining the pre-training language model corresponding to the target task, the method further includes: sequentially connecting a plurality of linear layers and nonlinear activation functions to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full connection layer and a classification layer to obtain a language model; and pre-training the language model based on the target task to obtain a pre-trained language model.

And connecting a plurality of linear layers in series and then connecting a nonlinear activation function as a feedforward layer. The residual layer behind the multi-head attention network is used for adding the output of the multi-head attention network and the input of the multi-head attention network; the residual layer after the feedforward layer is used to add the output of the feedforward layer to the input of the feedforward layer.

FIG. 2 is a flow chart of another method for training a pre-trained language model using noise perturbations provided in an embodiment of the present application. As shown in fig. 2, includes:

s201, dividing the training data set into a first training data set, a second training data set and a third training data set according to a second preset proportion, and performing multi-stage training on the pre-training language model:

s202, freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using a first training data set based on a target task to complete first-stage training of the pre-training language model;

s203, after the first-stage training is completed, thawing the parameter matrix in the pre-training language model, freezing the bias items in the pre-training language model, and optimizing the parameter matrix in the pre-training language model by using the second training data set based on the target task so as to complete the second-stage training of the pre-training language model;

S204, after the second stage training is completed, the bias items in the pre-training language model are unfrozen, and the bias items and the parameter matrix in the pre-training language model are optimized by using the third training data set based on the target task, so that the third stage training of the pre-training language model is completed.

The second predetermined ratio is related to the ratio of the parameter matrix to the trainable parameter and the ratio of the bias term to the trainable parameter in the pre-training language model. For example, the first preset ratio may be 1:6:3, and the ratio of the data amounts of the first training data set, the second training data set, and the third training data set is 1:6:3.

First stage training: freezing the parameter matrix, and training only the bias items by using the first training data set; after the first stage training is completed, thawing the parameter matrix; training in the second stage: freezing the bias term, and training the parameter matrix by using the second training data set; after the second stage training is completed, thawing the bias items; training in a third stage: and training the parameter matrix and the bias term by using a third training data set, wherein the third stage training is the training of the whole pre-training language model.

According to the method and the device, the accuracy of the final model can be greatly improved through multi-stage training of the pre-training language model.

In an alternative embodiment, a plurality of linear layers and nonlinear activation functions are sequentially connected to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model; acquiring a training data set corresponding to a target task; calculating noise disturbance corresponding to each parameter matrix in the language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the language model are optimized using the training dataset.

In this embodiment, the language model is not pre-trained, but is directly formally trained. By adopting the technical means, the problem that the model obtained through training often has over-fitting and low generalization capability in the prior art can be solved, so that the over-fitting of the model is avoided and the generalization capability of the model is improved.

Fig. 3 is a schematic structural diagram of a language model according to an embodiment of the present application. As shown in fig. 3, the language model sequentially includes, from an input end to an output end: an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full connection layer and a classification layer.

The residual layer after the feedforward layer is used for adding the output of the feedforward layer and the input of the feedforward layer, and outputting the added result; the residual layer behind the multi-head attention network is used for adding the output of the multi-head attention network and the input of the multi-head attention network, and outputting the added result.

FIG. 3 is also a schematic structural diagram of a pre-trained language model. The pre-trained language model is a pre-trained language model.

The language model may also be a BERT model, XLNET model, roBERTa model, and electrora model, etc. In model training, the optimizers used may be Adam optimizers, adamW optimizers, adaGrad optimizers, RMSProp optimizers.

In an alternative embodiment, a training data set and a pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter; and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.

Further, the noise disturbance corresponding to each network parameter in the pre-trained language model is calculated by the following formula:

；

wherein ,

noise disturbance corresponding to the ith network parameter, < ->

Is the standard deviation of the data within the ith network parameter.

Further, each network parameter is updated by the following formula:

；

wherein ,

noise disturbance corresponding to the ith network parameter, < ->

For the i-th network parameter before updating, < +.>

Is the i-th updated network parameter.

If sum->

Is inconsistent in dimension, can be applied to->

Filling is performed such that->

and />

Is uniform in dimension.

In an alternative embodiment, a plurality of linear layers and nonlinear activation functions are sequentially connected to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model; acquiring a training data set corresponding to a target task; calculating noise disturbance corresponding to each network parameter in the language model, and updating the network parameter according to the noise disturbance corresponding to each network parameter; based on the target task, the updated network parameters in the language model are optimized using the training dataset.

In an alternative embodiment, a training data set and a pre-training language model corresponding to the target task are obtained; determining a first network parameter and a second network parameter which correspond to the bias item and the parameter matrix in the pre-training language model respectively; calculating noise disturbance corresponding to each second network parameter in the pre-training language model, and updating the second network parameters according to the noise disturbance corresponding to each parameter matrix; based on the target task, the first network parameters and the updated second network parameters in the pre-training language model are optimized using the training data set.

In an alternative embodiment, optimizing bias terms and updated parameter matrices in the language model using the training dataset based on the target task includes: acquiring a trained target language model; inputting a plurality of training samples in a training data set into a language model and a target language model, and respectively outputting a first processing result and a second processing result corresponding to each training sample; calculating contrast loss by using a triplet loss function according to a first processing result and a second processing result corresponding to each training sample and a second processing result corresponding to another training sample with different semantics of the training sample; calculating classification loss by using a cross entropy loss function according to a first processing result and a label corresponding to each training sample; and updating the network parameters of the language model according to the comparison loss and the classification loss so as to complete the training of the language model.

The triplet loss function is triplet (). The first processing result and the second processing result corresponding to a certain training sample are respectively A1 and A2, the second processing result corresponding to another training sample with different semantics of the training sample is A3 (the other training sample with different semantics of the training sample is randomly determined in a training data set), the loss value corresponding to the first language corpus is equal to a triplet (A1, A2 and A3), and the loss values corresponding to all the training samples are added to be a comparison loss. And weighting and summing the comparison loss and the classification loss according to a preset weight to obtain total loss, and updating model parameters of the language model according to the total loss. According to the embodiment of the application, the comparison loss is introduced into model training, so that the problem that the translation model is over-fitted in the prior art can be solved, and the generalization performance of the model is improved.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

FIG. 4 is a schematic diagram of an apparatus for training a pre-trained language model using noise perturbations, as provided in an embodiment of the present application. As shown in fig. 4, the apparatus for training a pre-trained language model using noise disturbance includes:

an acquisition module 401 configured to acquire a training data set and a pre-training language model corresponding to a target task;

a calculation module 402 configured to calculate a noise disturbance corresponding to each parameter matrix in the pre-training language model, and update the parameter matrix according to the noise disturbance corresponding to each parameter matrix;

the training module 403 is configured to optimize bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task.

The bias term is bias unit or bias term or interference term, which is consistent with the meaning of b in the linear equation y=wx+b. In the linear equation y=wx+b, b represents the intercept of the function on the y-axis, controlling the distance of the function from the origin. The neural network model (the pre-training language model is a pre-training model, the pre-training model is a neural network model after being pre-trained) may also be represented by y=wx+b, unlike the linear equation, W in the neural network model represents a matrix, and the trainable parameters of the neural network model may also be represented as: (W, b), wherein W represents a parameter matrix and b represents a bias term. Parameters of the neural network model are divided into fixed parameters and trainable parameters, the trainable parameters including: a parameter matrix and a bias term. Training of neural network models is the process of optimizing trainable parameters.

Optionally, the calculation module 402 is further configured to calculate the noise disturbance corresponding to each parameter matrix in the pre-trained language model by the following formula:

；

wherein ,

noise disturbance corresponding to the ith parameter matrix, < ->

Is the standard deviation of the data inside the ith parameter matrix.

Optionally, the calculation module 402 is further configured to update each parameter matrix by:

；

wherein ,

noise disturbance corresponding to the ith parameter matrix, < ->

For the i-th parameter matrix before updating, < +.>

Is the i parameter matrix after updating.

If sum->

Is inconsistent in dimension, can be applied to->

Filling is performed such that->

and />

Is uniform in dimension.

Optionally, the training module 403 is further configured to divide the training data set into a first training data set and a second training data set according to a first preset ratio, and perform multi-stage training on the pre-training language model: freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using a first training data set based on a target task to complete first-stage training of the pre-training language model; after the first stage training is completed, thawing the parameter matrix in the pre-training language model, and optimizing the bias items and the parameter matrix in the pre-training language model by using the second training data set based on the target task to complete the second stage training of the pre-training language model.

Optionally, the training module 403 is further configured to divide the training data set into the first training data set and the second training data set according to a third preset ratio, and perform multi-stage training on the pre-training language model: freezing bias items in the pre-training language model, and optimizing a parameter matrix in the pre-training language model by using a first training data set based on a target task so as to complete first-stage training of the pre-training language model; after the first stage training is completed, the bias items in the pre-training language model are unfrozen, and based on the target task, the bias items and the parameter matrix in the pre-training language model are optimized by using the second training data set so as to complete the second stage training of the pre-training language model.

Optionally, the training module 403 is further configured to determine the data amount of the training data set; training the pre-training language model according to the data volume: when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing a training data set based on a target task; and when the data volume is not smaller than the first preset size, optimizing the bias items and the updated parameter matrix in the pre-training language model by utilizing the training data set based on the target task.

Optionally, the training module 403 is further configured to determine the data amount of the training data set; training the pre-training language model according to the data volume: when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing a training data set based on a target task; when the data volume is larger than or equal to the first preset size but smaller than the second preset size, freezing the bias items in the pre-training language model, and optimizing an updated parameter matrix in the pre-training language model by utilizing the training data set based on the target task; and when the data volume is larger than or equal to a second preset size, optimizing the bias items and the updated parameter matrix in the pre-training language model by using the training data set based on the target task.

Optionally, the obtaining module 401 is further configured to sequentially connect the plurality of linear layers and the nonlinear activation function to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full connection layer and a classification layer to obtain a language model; and pre-training the language model based on the target task to obtain a pre-trained language model.

Optionally, the training module 403 is further configured to divide the training data set into the first training data set, the second training data set and the third training data set according to a second preset ratio, and perform multi-stage training on the pre-training language model: freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using a first training data set based on a target task to complete first-stage training of the pre-training language model; after the first-stage training is completed, thawing the parameter matrix in the pre-training language model, freezing the bias items in the pre-training language model, and optimizing the parameter matrix in the pre-training language model by utilizing a second training data set based on the target task so as to complete the second-stage training of the pre-training language model; after the second stage training is completed, the bias items in the pre-training language model are unfrozen, and based on the target task, the bias items and the parameter matrix in the pre-training language model are optimized by utilizing the third training data set so as to complete the third stage training of the pre-training language model.

Optionally, the training module 403 is further configured to sequentially connect the plurality of linear layers and the nonlinear activation function to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model; acquiring a training data set corresponding to a target task; calculating noise disturbance corresponding to each parameter matrix in the language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the language model are optimized using the training dataset.

Optionally, the training module 403 is further configured to obtain a training data set and a pre-training language model corresponding to the target task; calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter; and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.

Optionally, the calculation module 402 is further configured to calculate a noise disturbance corresponding to each network parameter in the pre-trained language model by the following formula:

；

wherein ,

noise disturbance corresponding to the ith network parameter, < ->

For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in a pre-trained language model, & lt/EN & gt >

Is the standard deviation of the data within the ith network parameter.

Optionally, the computing module 402 is further configured to update each network parameter by:

；

wherein ,

noise disturbance corresponding to the ith network parameter, < ->

For the i-th network parameter before updating, < +.>

Is the i-th updated network parameter.

If sum->

Is inconsistent in dimension, can be applied to->

Filling is performed such that->

and />

Is uniform in dimension.

Optionally, the training module 403 is further configured to sequentially connect the plurality of linear layers and the nonlinear activation function to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model; acquiring a training data set corresponding to a target task; calculating noise disturbance corresponding to each network parameter in the language model, and updating the network parameter according to the noise disturbance corresponding to each network parameter; based on the target task, the updated network parameters in the language model are optimized using the training dataset.

Optionally, the training module 403 is further configured to obtain a training data set and a pre-training language model corresponding to the target task; determining a first network parameter and a second network parameter which correspond to the bias item and the parameter matrix in the pre-training language model respectively; calculating noise disturbance corresponding to each second network parameter in the pre-training language model, and updating the second network parameters according to the noise disturbance corresponding to each parameter matrix; based on the target task, the first network parameters and the updated second network parameters in the pre-training language model are optimized using the training data set.

Optionally, the training module 403 is further configured to obtain a target language model that has been trained; inputting a plurality of training samples in a training data set into a language model and a target language model, and respectively outputting a first processing result and a second processing result corresponding to each training sample; calculating contrast loss by using a triplet loss function according to a first processing result and a second processing result corresponding to each training sample and a second processing result corresponding to another training sample with different semantics of the training sample; calculating classification loss by using a cross entropy loss function according to a first processing result and a label corresponding to each training sample; and updating the network parameters of the language model according to the comparison loss and the classification loss so as to complete the training of the language model.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

Fig. 5 is a schematic diagram of an electronic device 5 provided in an embodiment of the present application. As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: a processor 501, a memory 502 and a computer program 503 stored in the memory 502 and executable on the processor 501. The steps of the various method embodiments described above are implemented by processor 501 when executing computer program 503. Alternatively, the processor 501, when executing the computer program 503, performs the functions of the modules/units in the above-described apparatus embodiments.

The electronic device 5 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 5 may include, but is not limited to, a processor 501 and a memory 502. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the electronic device 5 and is not limiting of the electronic device 5 and may include more or fewer components than shown, or different components.

The processor 501 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The memory 502 may be an internal storage unit of the electronic device 5, for example, a hard disk or a memory of the electronic device 5. The memory 502 may also be an external storage device of the electronic device 5, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 5. Memory 502 may also include both internal storage units and external storage devices of electronic device 5. The memory 502 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow in the methods of the above embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program may implement the steps of the respective method embodiments described above when executed by a processor. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method for training a pre-trained language model using noise perturbations, comprising:

acquiring a training data set and a pre-training language model corresponding to a target task;

calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix;

and optimizing bias items and updated parameter matrixes in the pre-training language model by utilizing the training data set based on the target task.

2. The method of claim 1, wherein the noise disturbance for each parameter matrix in the pre-trained language model is calculated by the formula:

；

wherein ,

noise disturbance corresponding to the ith parameter matrix, < ->

For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in the pre-trained language model, < +.>

Is the standard deviation of the data inside the ith parameter matrix.

3. The method of claim 1, wherein each parameter matrix is updated by the following formula:

；

wherein ,

noise disturbance corresponding to the ith parameter matrix, < ->

For the i-th parameter matrix before updating, < +.>

For the i-th updated parameterA number matrix.

4. The method of claim 1, wherein optimizing bias terms and updated parameter matrices in the pre-trained language model using the training dataset based on the target task comprises:

dividing the training data set into a first training data set and a second training data set according to a first preset proportion, and carrying out multi-stage training on the pre-training language model:

freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using the first training data set based on the target task so as to complete first-stage training of the pre-training language model;

After the first-stage training is completed, thawing the parameter matrix in the pre-training language model, and optimizing the bias items and the parameter matrix in the pre-training language model by using the second training data set based on the target task so as to complete the second-stage training of the pre-training language model.

5. The method of claim 1, wherein optimizing bias terms and updated parameter matrices in the pre-trained language model using the training dataset based on the target task comprises:

dividing the training data set into a first training data set, a second training data set and a third training data set according to a second preset proportion, and carrying out multi-stage training on the pre-training language model:

after the first-stage training is completed, thawing the parameter matrix in the pre-training language model, freezing the bias items in the pre-training language model, and optimizing the parameter matrix in the pre-training language model by utilizing the second training data set based on the target task so as to complete the second-stage training of the pre-training language model;

After the second stage training is completed, bias items in the pre-training language model are unfrozen, and based on the target task, the bias items and the parameter matrix in the pre-training language model are optimized by using the third training data set so as to complete the third stage training of the pre-training language model.

6. The method of claim 1, wherein optimizing bias terms and updated parameter matrices in the pre-trained language model using the training dataset based on the target task comprises:

determining a data volume of the training data set;

training the pre-training language model according to the data volume:

when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing the training data set based on the target task;

and when the data volume is not smaller than the first preset size, optimizing bias items and updated parameter matrixes in the pre-training language model by utilizing the training data set based on the target task.

7. The method of claim 1, wherein prior to obtaining the pre-trained language model corresponding to the target task, the method further comprises:

Sequentially connecting a plurality of linear layers and nonlinear activation functions to obtain a feedforward layer;

sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full connection layer and a classification layer to obtain a language model;

and pre-training the language model based on the target task to obtain the pre-training language model.

8. The method according to claim 1, wherein the method further comprises:

sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model;

acquiring a training data set corresponding to a target task;

calculating noise disturbance corresponding to each parameter matrix in the language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix;

based on the target task, bias items and updated parameter matrices in the language model are optimized using the training dataset.

9. The method according to claim 1, wherein the method further comprises:

calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter;

and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.

10. An apparatus for training a pre-trained language model using noise perturbations, comprising:

the acquisition module is configured to acquire a training data set and a pre-training language model corresponding to the target task;

the computing module is configured to compute noise disturbance corresponding to each parameter matrix in the pre-training language model and update the parameter matrix according to the noise disturbance corresponding to each parameter matrix;

a training module configured to optimize bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task.