CN116362351A - Method and device for training pre-training language model by using noise disturbance - Google Patents

Method and device for training pre-training language model by using noise disturbance Download PDF

Info

Publication number
CN116362351A
CN116362351A CN202310614779.5A CN202310614779A CN116362351A CN 116362351 A CN116362351 A CN 116362351A CN 202310614779 A CN202310614779 A CN 202310614779A CN 116362351 A CN116362351 A CN 116362351A
Authority
CN
China
Prior art keywords
training
language model
data set
parameter matrix
target task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310614779.5A
Other languages
Chinese (zh)
Other versions
CN116362351B (en
Inventor
吴亚军
暴宇健
王芳
徐琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xumi Yuntu Space Technology Co Ltd
Original Assignee
Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xumi Yuntu Space Technology Co Ltd filed Critical Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority to CN202310614779.5A priority Critical patent/CN116362351B/en
Publication of CN116362351A publication Critical patent/CN116362351A/en
Application granted granted Critical
Publication of CN116362351B publication Critical patent/CN116362351B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application relates to the technical field of machine learning, and provides a method and a device for training a pre-training language model by using noise disturbance. The method comprises the following steps: acquiring a training data set and a pre-training language model corresponding to a target task; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the pre-training language model are optimized using the training data set. By adopting the technical means, the problems that in the prior art, the model obtained by further training a pre-trained large-scale model often has over-fitting and low generalization capability are solved.

Description

Method and device for training pre-training language model by using noise disturbance
Technical Field
The present disclosure relates to the field of machine learning technologies, and in particular, to a method and apparatus for training a pre-training language model using noise disturbance.
Background
In recent years, with the development of machine learning technology, more and more large-scale models are being applied to the language field. In order to ensure that the large-scale model meets the requirements and to improve the training efficiency of the model, it is currently common to further train the already pre-trained large-scale model. However, the model obtained by further training the pre-trained large-scale model often has the problems of over fitting and low generalization capability.
Disclosure of Invention
In view of this, the embodiments of the present application provide a method, apparatus, electronic device, and computer readable storage medium for training a pre-training language model by using noise disturbance, so as to solve the problem in the prior art that the pre-trained large-scale model is further trained, and the finally obtained model often has over-fitting and low generalization capability.
In a first aspect of an embodiment of the present application, there is provided a method for training a pre-training language model using noise disturbance, including: acquiring a training data set and a pre-training language model corresponding to a target task; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the pre-training language model are optimized using the training data set.
In a second aspect of the embodiments of the present application, there is provided an apparatus for training a pre-training language model using noise disturbance, including: the acquisition module is configured to acquire a training data set and a pre-training language model corresponding to the target task; the computing module is configured to compute noise disturbance corresponding to each parameter matrix in the pre-training language model and update the parameter matrix according to the noise disturbance corresponding to each parameter matrix; the training module is configured to optimize bias items and updated parameter matrices in the pre-training language model using the training dataset based on the target task.
In a third aspect of the embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect of the embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.
Compared with the prior art, the embodiment of the application has the beneficial effects that: because the training data set and the pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, the bias items and the updated parameter matrix in the pre-training language model are optimized by utilizing the training data set, so that the problems that the model is often over-fitted and low in generalization capability in the final model can be solved by further training the pre-trained large-scale model in the prior art by adopting the technical means, and the over-fitting of the model is avoided and the generalization capability of the model is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of a method for training a pre-training language model using noise perturbations provided in an embodiment of the present application;
FIG. 2 is a flow chart of another method for training a pre-training language model using noise disturbance provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of a pre-training language model according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an apparatus for training a pre-training language model using noise disturbance according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
FIG. 1 is a flow chart of a method for training a pre-trained language model using noise perturbations, as provided in an embodiment of the present application. The method of training the pre-trained language model with noise perturbations of fig. 1 may be performed by a computer or server, or software on a computer or server. As shown in fig. 1, the method for training a pre-training language model by using noise disturbance includes:
s101, acquiring a training data set and a pre-training language model corresponding to a target task;
s102, calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix;
s103, optimizing bias items and updated parameter matrixes in the pre-training language model by utilizing the training data set based on the target task.
The pre-trained language model has a large number of bias terms and parameter matrices. The optimization of the parameter matrix in the pre-training language model is performed later, and the optimization of the updated parameter matrix in the pre-training language model is performed.
The bias term is bias unit or bias term or interference term, which is consistent with the meaning of b in the linear equation y=wx+b. In the linear equation y=wx+b, b represents the intercept of the function on the y-axis, controlling the distance of the function from the origin. The neural network model (the pre-training language model is a pre-training model, the pre-training model is a neural network model after being pre-trained) may also be represented by y=wx+b, unlike the linear equation, W and b in the neural network model represent matrices, and the trainable parameters of the neural network model may also be represented as: (W, b), wherein W represents a parameter matrix and b represents a bias term. Parameters of the neural network model are divided into fixed parameters and trainable parameters, the trainable parameters including: a parameter matrix and a bias term. Training of neural network models is the process of optimizing trainable parameters.
The method and the device can be used in any scene in the language field, such as text translation, word order prediction, next sentence prediction, question and answer tasks, named entity recognition tasks, text classification and the like. For example, in a text translation scenario, the target task is a text translation task; the training data set is a labeling corpus of text translation; the pre-training language model is a model obtained by pre-training the language model based on a text translation task; optimizing a parameter matrix and bias items in the pre-training language model by utilizing a training data set based on a text translation task; the final trained model is used for text translation. Other scenes are similar to the text translation scene.
According to the technical scheme provided by the embodiment of the application, a training data set and a pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the pre-training language model are optimized using the training data set. According to the method and the device, the noise disturbance is added into the parameter matrix, so that the influence of the pre-training on the overfitting and generalization capability of the language model is weakened, and the problems that the overfitting and the generalization capability of the model are low due to the fact that the model is often overfitted and the generalization capability of the model is improved due to the fact that the pre-trained large-scale model is further trained in the prior art can be solved.
Further, the noise disturbance corresponding to each parameter matrix in the pre-training language model is calculated by the following formula:
Figure SMS_1
wherein ,
Figure SMS_2
noise disturbance corresponding to the ith parameter matrix, < ->
Figure SMS_3
For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in a pre-trained language model, & lt/EN & gt>
Figure SMS_4
Is the standard deviation of the data inside the ith parameter matrix.
Further, each parameter matrix is updated by the following formula:
Figure SMS_5
wherein ,
Figure SMS_6
noise disturbance corresponding to the ith parameter matrix, < ->
Figure SMS_7
For the i-th parameter matrix before updating, < +.>
Figure SMS_8
Is the i parameter matrix after updating.
Figure SMS_9
If sum->
Figure SMS_10
Is inconsistent in dimension, can be applied to->
Figure SMS_11
Filling is performed such that->
Figure SMS_12
and />
Figure SMS_13
Is uniform in dimension.
Further, optimizing bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task, comprising: dividing the training data set into a first training data set and a second training data set according to a first preset proportion, and carrying out multi-stage training on the pre-training language model: freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using a first training data set based on a target task to complete first-stage training of the pre-training language model; after the first stage training is completed, thawing the parameter matrix in the pre-training language model, and optimizing the bias items and the parameter matrix in the pre-training language model by using the second training data set based on the target task to complete the second stage training of the pre-training language model.
The first predetermined ratio is related to the ratio of the parameter matrix to the trainable parameter and the ratio of the bias term to the trainable parameter in the pre-training language model. For example, the first preset ratio may be 1:9, and the ratio of the data amount of the first training data set to the second training data set is 1:9.
In this embodiment, the first stage training: freezing the parameter matrix, and training only the bias items by using the first training data set; after the first stage training is completed, thawing the parameter matrix; training in the second stage: and training the bias items and the parameter matrix by using a second training data set, wherein the second stage training is the training of the whole pre-training language model.
Further, optimizing bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task, comprising: dividing the training data set into a first training data set and a second training data set according to a third preset proportion, and carrying out multi-stage training on the pre-training language model: freezing bias items in the pre-training language model, and optimizing a parameter matrix in the pre-training language model by using a first training data set based on a target task so as to complete first-stage training of the pre-training language model; after the first stage training is completed, the bias items in the pre-training language model are unfrozen, and based on the target task, the bias items and the parameter matrix in the pre-training language model are optimized by using the second training data set so as to complete the second stage training of the pre-training language model.
In this embodiment, the first stage training: freezing the bias term, and training only the parameter matrix by using the first training data set; after the first stage training is completed, thawing the bias items; training in the second stage: and training the bias items and the parameter matrix by using a second training data set, wherein the second stage training is the training of the whole pre-training language model.
Further, optimizing bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task, comprising: determining a data volume of the training data set; training the pre-training language model according to the data volume: when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing a training data set based on a target task; and when the data volume is not smaller than the first preset size, optimizing the bias items and the updated parameter matrix in the pre-training language model by utilizing the training data set based on the target task.
The parameter matrix accounts for more than ninety-nine percent of the trainable parameters in the pre-training language model, and the bias term accounts for less than one percent. According to the method, when the data size is smaller than the first preset size, only bias items in the pre-training language model are optimized by utilizing the training data set (the method is applied to a small sample scene which is a condition with a small training sample number), the optimized parameter number and the training time consumption can be greatly reduced, and meanwhile model overfitting can be avoided when the training sample number is small. Through practice, the method can achieve good effect by optimizing only the bias items in the pre-training language model.
Further, optimizing bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task, comprising: determining a data volume of the training data set; training the pre-training language model according to the data volume: when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing a training data set based on a target task; when the data volume is larger than or equal to the first preset size but smaller than the second preset size, freezing the bias items in the pre-training language model, and optimizing an updated parameter matrix in the pre-training language model by utilizing the training data set based on the target task; and when the data volume is larger than or equal to a second preset size, optimizing the bias items and the updated parameter matrix in the pre-training language model by using the training data set based on the target task.
According to the embodiment of the application, the corresponding training method is selected according to the data volume of the training data set, so that training efficiency is improved.
Further, before obtaining the pre-training language model corresponding to the target task, the method further includes: sequentially connecting a plurality of linear layers and nonlinear activation functions to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full connection layer and a classification layer to obtain a language model; and pre-training the language model based on the target task to obtain a pre-trained language model.
And connecting a plurality of linear layers in series and then connecting a nonlinear activation function as a feedforward layer. The residual layer behind the multi-head attention network is used for adding the output of the multi-head attention network and the input of the multi-head attention network; the residual layer after the feedforward layer is used to add the output of the feedforward layer to the input of the feedforward layer.
FIG. 2 is a flow chart of another method for training a pre-trained language model using noise perturbations provided in an embodiment of the present application. As shown in fig. 2, includes:
s201, dividing the training data set into a first training data set, a second training data set and a third training data set according to a second preset proportion, and performing multi-stage training on the pre-training language model:
s202, freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using a first training data set based on a target task to complete first-stage training of the pre-training language model;
s203, after the first-stage training is completed, thawing the parameter matrix in the pre-training language model, freezing the bias items in the pre-training language model, and optimizing the parameter matrix in the pre-training language model by using the second training data set based on the target task so as to complete the second-stage training of the pre-training language model;
S204, after the second stage training is completed, the bias items in the pre-training language model are unfrozen, and the bias items and the parameter matrix in the pre-training language model are optimized by using the third training data set based on the target task, so that the third stage training of the pre-training language model is completed.
The second predetermined ratio is related to the ratio of the parameter matrix to the trainable parameter and the ratio of the bias term to the trainable parameter in the pre-training language model. For example, the first preset ratio may be 1:6:3, and the ratio of the data amounts of the first training data set, the second training data set, and the third training data set is 1:6:3.
First stage training: freezing the parameter matrix, and training only the bias items by using the first training data set; after the first stage training is completed, thawing the parameter matrix; training in the second stage: freezing the bias term, and training the parameter matrix by using the second training data set; after the second stage training is completed, thawing the bias items; training in a third stage: and training the parameter matrix and the bias term by using a third training data set, wherein the third stage training is the training of the whole pre-training language model.
According to the method and the device, the accuracy of the final model can be greatly improved through multi-stage training of the pre-training language model.
In an alternative embodiment, a plurality of linear layers and nonlinear activation functions are sequentially connected to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model; acquiring a training data set corresponding to a target task; calculating noise disturbance corresponding to each parameter matrix in the language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the language model are optimized using the training dataset.
In this embodiment, the language model is not pre-trained, but is directly formally trained. By adopting the technical means, the problem that the model obtained through training often has over-fitting and low generalization capability in the prior art can be solved, so that the over-fitting of the model is avoided and the generalization capability of the model is improved.
Fig. 3 is a schematic structural diagram of a language model according to an embodiment of the present application. As shown in fig. 3, the language model sequentially includes, from an input end to an output end: an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full connection layer and a classification layer.
The residual layer after the feedforward layer is used for adding the output of the feedforward layer and the input of the feedforward layer, and outputting the added result; the residual layer behind the multi-head attention network is used for adding the output of the multi-head attention network and the input of the multi-head attention network, and outputting the added result.
FIG. 3 is also a schematic structural diagram of a pre-trained language model. The pre-trained language model is a pre-trained language model.
The language model may also be a BERT model, XLNET model, roBERTa model, and electrora model, etc. In model training, the optimizers used may be Adam optimizers, adamW optimizers, adaGrad optimizers, RMSProp optimizers.
In an alternative embodiment, a training data set and a pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter; and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.
Further, the noise disturbance corresponding to each network parameter in the pre-trained language model is calculated by the following formula:
Figure SMS_14
wherein ,
Figure SMS_15
noise disturbance corresponding to the ith network parameter, < ->
Figure SMS_16
For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in a pre-trained language model, & lt/EN & gt>
Figure SMS_17
Is the standard deviation of the data within the ith network parameter.
Further, each network parameter is updated by the following formula:
Figure SMS_18
wherein ,
Figure SMS_19
noise disturbance corresponding to the ith network parameter, < ->
Figure SMS_20
For the i-th network parameter before updating, < +.>
Figure SMS_21
Is the i-th updated network parameter.
Figure SMS_22
If sum->
Figure SMS_23
Is inconsistent in dimension, can be applied to->
Figure SMS_24
Filling is performed such that->
Figure SMS_25
and />
Figure SMS_26
Is uniform in dimension.
In an alternative embodiment, a plurality of linear layers and nonlinear activation functions are sequentially connected to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model; acquiring a training data set corresponding to a target task; calculating noise disturbance corresponding to each network parameter in the language model, and updating the network parameter according to the noise disturbance corresponding to each network parameter; based on the target task, the updated network parameters in the language model are optimized using the training dataset.
In an alternative embodiment, a training data set and a pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter; and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.
In an alternative embodiment, a training data set and a pre-training language model corresponding to the target task are obtained; determining a first network parameter and a second network parameter which correspond to the bias item and the parameter matrix in the pre-training language model respectively; calculating noise disturbance corresponding to each second network parameter in the pre-training language model, and updating the second network parameters according to the noise disturbance corresponding to each parameter matrix; based on the target task, the first network parameters and the updated second network parameters in the pre-training language model are optimized using the training data set.
In an alternative embodiment, optimizing bias terms and updated parameter matrices in the language model using the training dataset based on the target task includes: acquiring a trained target language model; inputting a plurality of training samples in a training data set into a language model and a target language model, and respectively outputting a first processing result and a second processing result corresponding to each training sample; calculating contrast loss by using a triplet loss function according to a first processing result and a second processing result corresponding to each training sample and a second processing result corresponding to another training sample with different semantics of the training sample; calculating classification loss by using a cross entropy loss function according to a first processing result and a label corresponding to each training sample; and updating the network parameters of the language model according to the comparison loss and the classification loss so as to complete the training of the language model.
The triplet loss function is triplet (). The first processing result and the second processing result corresponding to a certain training sample are respectively A1 and A2, the second processing result corresponding to another training sample with different semantics of the training sample is A3 (the other training sample with different semantics of the training sample is randomly determined in a training data set), the loss value corresponding to the first language corpus is equal to a triplet (A1, A2 and A3), and the loss values corresponding to all the training samples are added to be a comparison loss. And weighting and summing the comparison loss and the classification loss according to a preset weight to obtain total loss, and updating model parameters of the language model according to the total loss. According to the embodiment of the application, the comparison loss is introduced into model training, so that the problem that the translation model is over-fitted in the prior art can be solved, and the generalization performance of the model is improved.
Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.
The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.
FIG. 4 is a schematic diagram of an apparatus for training a pre-trained language model using noise perturbations, as provided in an embodiment of the present application. As shown in fig. 4, the apparatus for training a pre-trained language model using noise disturbance includes:
an acquisition module 401 configured to acquire a training data set and a pre-training language model corresponding to a target task;
a calculation module 402 configured to calculate a noise disturbance corresponding to each parameter matrix in the pre-training language model, and update the parameter matrix according to the noise disturbance corresponding to each parameter matrix;
the training module 403 is configured to optimize bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task.
The pre-trained language model has a large number of bias terms and parameter matrices. The optimization of the parameter matrix in the pre-training language model is performed later, and the optimization of the updated parameter matrix in the pre-training language model is performed.
The bias term is bias unit or bias term or interference term, which is consistent with the meaning of b in the linear equation y=wx+b. In the linear equation y=wx+b, b represents the intercept of the function on the y-axis, controlling the distance of the function from the origin. The neural network model (the pre-training language model is a pre-training model, the pre-training model is a neural network model after being pre-trained) may also be represented by y=wx+b, unlike the linear equation, W in the neural network model represents a matrix, and the trainable parameters of the neural network model may also be represented as: (W, b), wherein W represents a parameter matrix and b represents a bias term. Parameters of the neural network model are divided into fixed parameters and trainable parameters, the trainable parameters including: a parameter matrix and a bias term. Training of neural network models is the process of optimizing trainable parameters.
The method and the device can be used in any scene in the language field, such as text translation, word order prediction, next sentence prediction, question and answer tasks, named entity recognition tasks, text classification and the like. For example, in a text translation scenario, the target task is a text translation task; the training data set is a labeling corpus of text translation; the pre-training language model is a model obtained by pre-training the language model based on a text translation task; optimizing a parameter matrix and bias items in the pre-training language model by utilizing a training data set based on a text translation task; the final trained model is used for text translation. Other scenes are similar to the text translation scene.
According to the technical scheme provided by the embodiment of the application, a training data set and a pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the pre-training language model are optimized using the training data set. According to the method and the device, the noise disturbance is added into the parameter matrix, so that the influence of the pre-training on the overfitting and generalization capability of the language model is weakened, and the problems that the overfitting and the generalization capability of the model are low due to the fact that the model is often overfitted and the generalization capability of the model is improved due to the fact that the pre-trained large-scale model is further trained in the prior art can be solved.
Optionally, the calculation module 402 is further configured to calculate the noise disturbance corresponding to each parameter matrix in the pre-trained language model by the following formula:
Figure SMS_27
wherein ,
Figure SMS_28
noise disturbance corresponding to the ith parameter matrix, < ->
Figure SMS_29
For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in a pre-trained language model, & lt/EN & gt>
Figure SMS_30
Is the standard deviation of the data inside the ith parameter matrix.
Optionally, the calculation module 402 is further configured to update each parameter matrix by:
Figure SMS_31
wherein ,
Figure SMS_32
noise disturbance corresponding to the ith parameter matrix, < ->
Figure SMS_33
For the i-th parameter matrix before updating, < +.>
Figure SMS_34
Is the i parameter matrix after updating.
Figure SMS_35
If sum->
Figure SMS_36
Is inconsistent in dimension, can be applied to->
Figure SMS_37
Filling is performed such that->
Figure SMS_38
and />
Figure SMS_39
Is uniform in dimension.
Optionally, the training module 403 is further configured to divide the training data set into a first training data set and a second training data set according to a first preset ratio, and perform multi-stage training on the pre-training language model: freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using a first training data set based on a target task to complete first-stage training of the pre-training language model; after the first stage training is completed, thawing the parameter matrix in the pre-training language model, and optimizing the bias items and the parameter matrix in the pre-training language model by using the second training data set based on the target task to complete the second stage training of the pre-training language model.
The first predetermined ratio is related to the ratio of the parameter matrix to the trainable parameter and the ratio of the bias term to the trainable parameter in the pre-training language model. For example, the first preset ratio may be 1:9, and the ratio of the data amount of the first training data set to the second training data set is 1:9.
In this embodiment, the first stage training: freezing the parameter matrix, and training only the bias items by using the first training data set; after the first stage training is completed, thawing the parameter matrix; training in the second stage: and training the bias items and the parameter matrix by using a second training data set, wherein the second stage training is the training of the whole pre-training language model.
Optionally, the training module 403 is further configured to divide the training data set into the first training data set and the second training data set according to a third preset ratio, and perform multi-stage training on the pre-training language model: freezing bias items in the pre-training language model, and optimizing a parameter matrix in the pre-training language model by using a first training data set based on a target task so as to complete first-stage training of the pre-training language model; after the first stage training is completed, the bias items in the pre-training language model are unfrozen, and based on the target task, the bias items and the parameter matrix in the pre-training language model are optimized by using the second training data set so as to complete the second stage training of the pre-training language model.
In this embodiment, the first stage training: freezing the bias term, and training only the parameter matrix by using the first training data set; after the first stage training is completed, thawing the bias items; training in the second stage: and training the bias items and the parameter matrix by using a second training data set, wherein the second stage training is the training of the whole pre-training language model.
Optionally, the training module 403 is further configured to determine the data amount of the training data set; training the pre-training language model according to the data volume: when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing a training data set based on a target task; and when the data volume is not smaller than the first preset size, optimizing the bias items and the updated parameter matrix in the pre-training language model by utilizing the training data set based on the target task.
The parameter matrix accounts for more than ninety-nine percent of the trainable parameters in the pre-training language model, and the bias term accounts for less than one percent. According to the method, when the data size is smaller than the first preset size, only bias items in the pre-training language model are optimized by utilizing the training data set (the method is applied to a small sample scene which is a condition with a small training sample number), the optimized parameter number and the training time consumption can be greatly reduced, and meanwhile model overfitting can be avoided when the training sample number is small. Through practice, the method can achieve good effect by optimizing only the bias items in the pre-training language model.
Optionally, the training module 403 is further configured to determine the data amount of the training data set; training the pre-training language model according to the data volume: when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing a training data set based on a target task; when the data volume is larger than or equal to the first preset size but smaller than the second preset size, freezing the bias items in the pre-training language model, and optimizing an updated parameter matrix in the pre-training language model by utilizing the training data set based on the target task; and when the data volume is larger than or equal to a second preset size, optimizing the bias items and the updated parameter matrix in the pre-training language model by using the training data set based on the target task.
According to the embodiment of the application, the corresponding training method is selected according to the data volume of the training data set, so that training efficiency is improved.
Optionally, the obtaining module 401 is further configured to sequentially connect the plurality of linear layers and the nonlinear activation function to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full connection layer and a classification layer to obtain a language model; and pre-training the language model based on the target task to obtain a pre-trained language model.
And connecting a plurality of linear layers in series and then connecting a nonlinear activation function as a feedforward layer. The residual layer behind the multi-head attention network is used for adding the output of the multi-head attention network and the input of the multi-head attention network; the residual layer after the feedforward layer is used to add the output of the feedforward layer to the input of the feedforward layer.
Optionally, the training module 403 is further configured to divide the training data set into the first training data set, the second training data set and the third training data set according to a second preset ratio, and perform multi-stage training on the pre-training language model: freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using a first training data set based on a target task to complete first-stage training of the pre-training language model; after the first-stage training is completed, thawing the parameter matrix in the pre-training language model, freezing the bias items in the pre-training language model, and optimizing the parameter matrix in the pre-training language model by utilizing a second training data set based on the target task so as to complete the second-stage training of the pre-training language model; after the second stage training is completed, the bias items in the pre-training language model are unfrozen, and based on the target task, the bias items and the parameter matrix in the pre-training language model are optimized by utilizing the third training data set so as to complete the third stage training of the pre-training language model.
The second predetermined ratio is related to the ratio of the parameter matrix to the trainable parameter and the ratio of the bias term to the trainable parameter in the pre-training language model. For example, the first preset ratio may be 1:6:3, and the ratio of the data amounts of the first training data set, the second training data set, and the third training data set is 1:6:3.
First stage training: freezing the parameter matrix, and training only the bias items by using the first training data set; after the first stage training is completed, thawing the parameter matrix; training in the second stage: freezing the bias term, and training the parameter matrix by using the second training data set; after the second stage training is completed, thawing the bias items; training in a third stage: and training the parameter matrix and the bias term by using a third training data set, wherein the third stage training is the training of the whole pre-training language model.
According to the method and the device, the accuracy of the final model can be greatly improved through multi-stage training of the pre-training language model.
Optionally, the training module 403 is further configured to sequentially connect the plurality of linear layers and the nonlinear activation function to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model; acquiring a training data set corresponding to a target task; calculating noise disturbance corresponding to each parameter matrix in the language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the language model are optimized using the training dataset.
In this embodiment, the language model is not pre-trained, but is directly formally trained. By adopting the technical means, the problem that the model obtained through training often has over-fitting and low generalization capability in the prior art can be solved, so that the over-fitting of the model is avoided and the generalization capability of the model is improved.
Optionally, the training module 403 is further configured to obtain a training data set and a pre-training language model corresponding to the target task; calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter; and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.
Optionally, the calculation module 402 is further configured to calculate a noise disturbance corresponding to each network parameter in the pre-trained language model by the following formula:
Figure SMS_40
wherein ,
Figure SMS_41
noise disturbance corresponding to the ith network parameter, < ->
Figure SMS_42
For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in a pre-trained language model, & lt/EN & gt >
Figure SMS_43
Is the standard deviation of the data within the ith network parameter.
Optionally, the computing module 402 is further configured to update each network parameter by:
Figure SMS_44
wherein ,
Figure SMS_45
noise disturbance corresponding to the ith network parameter, < ->
Figure SMS_46
For the i-th network parameter before updating, < +.>
Figure SMS_47
Is the i-th updated network parameter.
Figure SMS_48
If sum->
Figure SMS_49
Is inconsistent in dimension, can be applied to->
Figure SMS_50
Filling is performed such that->
Figure SMS_51
and />
Figure SMS_52
Is uniform in dimension.
Optionally, the training module 403 is further configured to sequentially connect the plurality of linear layers and the nonlinear activation function to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model; acquiring a training data set corresponding to a target task; calculating noise disturbance corresponding to each network parameter in the language model, and updating the network parameter according to the noise disturbance corresponding to each network parameter; based on the target task, the updated network parameters in the language model are optimized using the training dataset.
Optionally, the training module 403 is further configured to obtain a training data set and a pre-training language model corresponding to the target task; calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter; and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.
Optionally, the training module 403 is further configured to obtain a training data set and a pre-training language model corresponding to the target task; determining a first network parameter and a second network parameter which correspond to the bias item and the parameter matrix in the pre-training language model respectively; calculating noise disturbance corresponding to each second network parameter in the pre-training language model, and updating the second network parameters according to the noise disturbance corresponding to each parameter matrix; based on the target task, the first network parameters and the updated second network parameters in the pre-training language model are optimized using the training data set.
Optionally, the training module 403 is further configured to obtain a target language model that has been trained; inputting a plurality of training samples in a training data set into a language model and a target language model, and respectively outputting a first processing result and a second processing result corresponding to each training sample; calculating contrast loss by using a triplet loss function according to a first processing result and a second processing result corresponding to each training sample and a second processing result corresponding to another training sample with different semantics of the training sample; calculating classification loss by using a cross entropy loss function according to a first processing result and a label corresponding to each training sample; and updating the network parameters of the language model according to the comparison loss and the classification loss so as to complete the training of the language model.
The triplet loss function is triplet (). The first processing result and the second processing result corresponding to a certain training sample are respectively A1 and A2, the second processing result corresponding to another training sample with different semantics of the training sample is A3 (the other training sample with different semantics of the training sample is randomly determined in a training data set), the loss value corresponding to the first language corpus is equal to a triplet (A1, A2 and A3), and the loss values corresponding to all the training samples are added to be a comparison loss. And weighting and summing the comparison loss and the classification loss according to a preset weight to obtain total loss, and updating model parameters of the language model according to the total loss. According to the embodiment of the application, the comparison loss is introduced into model training, so that the problem that the translation model is over-fitted in the prior art can be solved, and the generalization performance of the model is improved.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.
Fig. 5 is a schematic diagram of an electronic device 5 provided in an embodiment of the present application. As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: a processor 501, a memory 502 and a computer program 503 stored in the memory 502 and executable on the processor 501. The steps of the various method embodiments described above are implemented by processor 501 when executing computer program 503. Alternatively, the processor 501, when executing the computer program 503, performs the functions of the modules/units in the above-described apparatus embodiments.
The electronic device 5 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 5 may include, but is not limited to, a processor 501 and a memory 502. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the electronic device 5 and is not limiting of the electronic device 5 and may include more or fewer components than shown, or different components.
The processor 501 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
The memory 502 may be an internal storage unit of the electronic device 5, for example, a hard disk or a memory of the electronic device 5. The memory 502 may also be an external storage device of the electronic device 5, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 5. Memory 502 may also include both internal storage units and external storage devices of electronic device 5. The memory 502 is used to store computer programs and other programs and data required by the electronic device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow in the methods of the above embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program may implement the steps of the respective method embodiments described above when executed by a processor. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (10)

1. A method for training a pre-trained language model using noise perturbations, comprising:
acquiring a training data set and a pre-training language model corresponding to a target task;
calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix;
and optimizing bias items and updated parameter matrixes in the pre-training language model by utilizing the training data set based on the target task.
2. The method of claim 1, wherein the noise disturbance for each parameter matrix in the pre-trained language model is calculated by the formula:
Figure QLYQS_1
wherein ,
Figure QLYQS_2
noise disturbance corresponding to the ith parameter matrix, < ->
Figure QLYQS_3
For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in the pre-trained language model, < +.>
Figure QLYQS_4
Is the standard deviation of the data inside the ith parameter matrix.
3. The method of claim 1, wherein each parameter matrix is updated by the following formula:
Figure QLYQS_5
wherein ,
Figure QLYQS_6
noise disturbance corresponding to the ith parameter matrix, < ->
Figure QLYQS_7
For the i-th parameter matrix before updating, < +.>
Figure QLYQS_8
For the i-th updated parameterA number matrix.
4. The method of claim 1, wherein optimizing bias terms and updated parameter matrices in the pre-trained language model using the training dataset based on the target task comprises:
dividing the training data set into a first training data set and a second training data set according to a first preset proportion, and carrying out multi-stage training on the pre-training language model:
freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using the first training data set based on the target task so as to complete first-stage training of the pre-training language model;
After the first-stage training is completed, thawing the parameter matrix in the pre-training language model, and optimizing the bias items and the parameter matrix in the pre-training language model by using the second training data set based on the target task so as to complete the second-stage training of the pre-training language model.
5. The method of claim 1, wherein optimizing bias terms and updated parameter matrices in the pre-trained language model using the training dataset based on the target task comprises:
dividing the training data set into a first training data set, a second training data set and a third training data set according to a second preset proportion, and carrying out multi-stage training on the pre-training language model:
freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using the first training data set based on the target task so as to complete first-stage training of the pre-training language model;
after the first-stage training is completed, thawing the parameter matrix in the pre-training language model, freezing the bias items in the pre-training language model, and optimizing the parameter matrix in the pre-training language model by utilizing the second training data set based on the target task so as to complete the second-stage training of the pre-training language model;
After the second stage training is completed, bias items in the pre-training language model are unfrozen, and based on the target task, the bias items and the parameter matrix in the pre-training language model are optimized by using the third training data set so as to complete the third stage training of the pre-training language model.
6. The method of claim 1, wherein optimizing bias terms and updated parameter matrices in the pre-trained language model using the training dataset based on the target task comprises:
determining a data volume of the training data set;
training the pre-training language model according to the data volume:
when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing the training data set based on the target task;
and when the data volume is not smaller than the first preset size, optimizing bias items and updated parameter matrixes in the pre-training language model by utilizing the training data set based on the target task.
7. The method of claim 1, wherein prior to obtaining the pre-trained language model corresponding to the target task, the method further comprises:
Sequentially connecting a plurality of linear layers and nonlinear activation functions to obtain a feedforward layer;
sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full connection layer and a classification layer to obtain a language model;
and pre-training the language model based on the target task to obtain the pre-training language model.
8. The method according to claim 1, wherein the method further comprises:
sequentially connecting a plurality of linear layers and nonlinear activation functions to obtain a feedforward layer;
sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model;
acquiring a training data set corresponding to a target task;
calculating noise disturbance corresponding to each parameter matrix in the language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix;
based on the target task, bias items and updated parameter matrices in the language model are optimized using the training dataset.
9. The method according to claim 1, wherein the method further comprises:
Acquiring a training data set and a pre-training language model corresponding to a target task;
calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter;
and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.
10. An apparatus for training a pre-trained language model using noise perturbations, comprising:
the acquisition module is configured to acquire a training data set and a pre-training language model corresponding to the target task;
the computing module is configured to compute noise disturbance corresponding to each parameter matrix in the pre-training language model and update the parameter matrix according to the noise disturbance corresponding to each parameter matrix;
a training module configured to optimize bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task.
CN202310614779.5A 2023-05-29 2023-05-29 Method and device for training pre-training language model by using noise disturbance Active CN116362351B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310614779.5A CN116362351B (en) 2023-05-29 2023-05-29 Method and device for training pre-training language model by using noise disturbance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310614779.5A CN116362351B (en) 2023-05-29 2023-05-29 Method and device for training pre-training language model by using noise disturbance

Publications (2)

Publication Number Publication Date
CN116362351A true CN116362351A (en) 2023-06-30
CN116362351B CN116362351B (en) 2023-09-26

Family

ID=86939890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310614779.5A Active CN116362351B (en) 2023-05-29 2023-05-29 Method and device for training pre-training language model by using noise disturbance

Country Status (1)

Country Link
CN (1) CN116362351B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116522152A (en) * 2023-07-05 2023-08-01 深圳须弥云图空间科技有限公司 Translation model training method and device based on back translation
CN116595130A (en) * 2023-07-18 2023-08-15 深圳须弥云图空间科技有限公司 Corpus expansion method and device under multiple tasks based on small language model
CN116595385A (en) * 2023-07-18 2023-08-15 深圳须弥云图空间科技有限公司 Composition generation model training method and device
CN116603249A (en) * 2023-07-19 2023-08-18 深圳须弥云图空间科技有限公司 Training method of large language model applied to role playing reasoning game

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919320A (en) * 2019-01-23 2019-06-21 西北工业大学 Triplet online learning methods based on Semantic hierarchy
US20190355366A1 (en) * 2018-05-18 2019-11-21 Emotech Ltd Speaker recognition
CN112070010A (en) * 2020-09-08 2020-12-11 长沙理工大学 Pedestrian re-recognition method combining multi-loss dynamic training strategy to enhance local feature learning
CN112183468A (en) * 2020-10-27 2021-01-05 南京信息工程大学 Pedestrian re-identification method based on multi-attention combined multi-level features
CN113052324A (en) * 2021-03-24 2021-06-29 支付宝(杭州)信息技术有限公司 User abnormal pattern recognition method, device and equipment
CN113111663A (en) * 2021-04-28 2021-07-13 东南大学 Abstract generation method fusing key information
CN113468854A (en) * 2021-06-24 2021-10-01 浙江华巽科技有限公司 Multi-document automatic abstract generation method
US20210319176A1 (en) * 2020-04-13 2021-10-14 Capital One Services, Llc Efficient automatic punctuation with robust inference
CN114818902A (en) * 2022-04-21 2022-07-29 浪潮云信息技术股份公司 Text classification method and system based on knowledge distillation
CN114972904A (en) * 2022-04-18 2022-08-30 北京理工大学 Zero sample knowledge distillation method and system based on triple loss resistance
CN115734029A (en) * 2022-11-07 2023-03-03 中国电信股份有限公司 Terminal suitability judgment method and device, electronic equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190355366A1 (en) * 2018-05-18 2019-11-21 Emotech Ltd Speaker recognition
CN109919320A (en) * 2019-01-23 2019-06-21 西北工业大学 Triplet online learning methods based on Semantic hierarchy
US20210319176A1 (en) * 2020-04-13 2021-10-14 Capital One Services, Llc Efficient automatic punctuation with robust inference
CN112070010A (en) * 2020-09-08 2020-12-11 长沙理工大学 Pedestrian re-recognition method combining multi-loss dynamic training strategy to enhance local feature learning
CN112183468A (en) * 2020-10-27 2021-01-05 南京信息工程大学 Pedestrian re-identification method based on multi-attention combined multi-level features
CN113052324A (en) * 2021-03-24 2021-06-29 支付宝(杭州)信息技术有限公司 User abnormal pattern recognition method, device and equipment
CN113111663A (en) * 2021-04-28 2021-07-13 东南大学 Abstract generation method fusing key information
CN113468854A (en) * 2021-06-24 2021-10-01 浙江华巽科技有限公司 Multi-document automatic abstract generation method
CN114972904A (en) * 2022-04-18 2022-08-30 北京理工大学 Zero sample knowledge distillation method and system based on triple loss resistance
CN114818902A (en) * 2022-04-21 2022-07-29 浪潮云信息技术股份公司 Text classification method and system based on knowledge distillation
CN115734029A (en) * 2022-11-07 2023-03-03 中国电信股份有限公司 Terminal suitability judgment method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHUHANWU ET AL: "NoisyTune: A Little Noise Can Help You Finetune Pretrained Language Models Better", pages 1 - 6, Retrieved from the Internet <URL:https://arxiv.org/pdf/2202.12024.pdf> *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116522152A (en) * 2023-07-05 2023-08-01 深圳须弥云图空间科技有限公司 Translation model training method and device based on back translation
CN116522152B (en) * 2023-07-05 2023-11-10 深圳须弥云图空间科技有限公司 Translation model training method and device based on back translation
CN116595130A (en) * 2023-07-18 2023-08-15 深圳须弥云图空间科技有限公司 Corpus expansion method and device under multiple tasks based on small language model
CN116595385A (en) * 2023-07-18 2023-08-15 深圳须弥云图空间科技有限公司 Composition generation model training method and device
CN116595385B (en) * 2023-07-18 2023-10-03 深圳须弥云图空间科技有限公司 Composition generation model training method and device
CN116595130B (en) * 2023-07-18 2024-02-20 深圳须弥云图空间科技有限公司 Corpus expansion method and device under multiple tasks based on small language model
CN116603249A (en) * 2023-07-19 2023-08-18 深圳须弥云图空间科技有限公司 Training method of large language model applied to role playing reasoning game
CN116603249B (en) * 2023-07-19 2023-10-03 深圳须弥云图空间科技有限公司 Training method of large language model applied to role playing reasoning game

Also Published As

Publication number Publication date
CN116362351B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
CN116362351B (en) Method and device for training pre-training language model by using noise disturbance
US20230368024A1 (en) Neural architecture search
US20210232929A1 (en) Neural architecture search
EP3564863B1 (en) Apparatus for executing lstm neural network operation, and operational method
WO2021089012A1 (en) Node classification method and apparatus for graph network model, and terminal device
US20200410365A1 (en) Unsupervised neural network training using learned optimizers
EP3362951B1 (en) Neural random access machine
US20220004849A1 (en) Image processing neural networks with dynamic filter activation
CN112116104B (en) Method, device, medium and electronic equipment for automatically integrating machine learning
CN116403250A (en) Face recognition method and device with shielding
CN116912635B (en) Target tracking method and device
CN116595130B (en) Corpus expansion method and device under multiple tasks based on small language model
CN113850298A (en) Image identification method and device and related equipment
CN116542328B (en) Knowledge distillation method and device for CTR prediction model
CN116629342A (en) Model bypass optimization method and device
CN116610788A (en) Method and device for training pre-training language model based on data volume of training data
CN116341640B (en) Text processing model training method and device
TWI763975B (en) System and method for reducing computational complexity of artificial neural network
CN116502640B (en) Text characterization model training method and device based on context
CN117474037B (en) Knowledge distillation method and device based on space distance alignment
CN116151232B (en) Method and device for generating model by multi-stage training text title
CN116306791A (en) Text processing method and device for improving self-attention model
CN118504658A (en) Pre-training federal learning fine tuning method, system, electronic equipment and storage medium
CN116628204A (en) Method and device for training text classification model based on training data volume
CN118333125A (en) Fine tuning training method and device for image generation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant