CN116362351A - Method and device for training pre-training language model by using noise disturbance - Google Patents
Method and device for training pre-training language model by using noise disturbance Download PDFInfo
- Publication number
- CN116362351A CN116362351A CN202310614779.5A CN202310614779A CN116362351A CN 116362351 A CN116362351 A CN 116362351A CN 202310614779 A CN202310614779 A CN 202310614779A CN 116362351 A CN116362351 A CN 116362351A
- Authority
- CN
- China
- Prior art keywords
- training
- language model
- data set
- parameter matrix
- target task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 536
- 238000000034 method Methods 0.000 title claims abstract description 54
- 239000011159 matrix material Substances 0.000 claims abstract description 138
- 230000008014 freezing Effects 0.000 claims description 26
- 238000007710 freezing Methods 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 23
- 238000010606 normalization Methods 0.000 claims description 18
- 238000010257 thawing Methods 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 10
- 238000010801 machine learning Methods 0.000 abstract description 3
- 238000012545 processing Methods 0.000 description 21
- 238000013519 translation Methods 0.000 description 18
- 238000004590 computer program Methods 0.000 description 13
- 238000003062 neural network model Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application relates to the technical field of machine learning, and provides a method and a device for training a pre-training language model by using noise disturbance. The method comprises the following steps: acquiring a training data set and a pre-training language model corresponding to a target task; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the pre-training language model are optimized using the training data set. By adopting the technical means, the problems that in the prior art, the model obtained by further training a pre-trained large-scale model often has over-fitting and low generalization capability are solved.
Description
Technical Field
The present disclosure relates to the field of machine learning technologies, and in particular, to a method and apparatus for training a pre-training language model using noise disturbance.
Background
In recent years, with the development of machine learning technology, more and more large-scale models are being applied to the language field. In order to ensure that the large-scale model meets the requirements and to improve the training efficiency of the model, it is currently common to further train the already pre-trained large-scale model. However, the model obtained by further training the pre-trained large-scale model often has the problems of over fitting and low generalization capability.
Disclosure of Invention
In view of this, the embodiments of the present application provide a method, apparatus, electronic device, and computer readable storage medium for training a pre-training language model by using noise disturbance, so as to solve the problem in the prior art that the pre-trained large-scale model is further trained, and the finally obtained model often has over-fitting and low generalization capability.
In a first aspect of an embodiment of the present application, there is provided a method for training a pre-training language model using noise disturbance, including: acquiring a training data set and a pre-training language model corresponding to a target task; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the pre-training language model are optimized using the training data set.
In a second aspect of the embodiments of the present application, there is provided an apparatus for training a pre-training language model using noise disturbance, including: the acquisition module is configured to acquire a training data set and a pre-training language model corresponding to the target task; the computing module is configured to compute noise disturbance corresponding to each parameter matrix in the pre-training language model and update the parameter matrix according to the noise disturbance corresponding to each parameter matrix; the training module is configured to optimize bias items and updated parameter matrices in the pre-training language model using the training dataset based on the target task.
In a third aspect of the embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect of the embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.
Compared with the prior art, the embodiment of the application has the beneficial effects that: because the training data set and the pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, the bias items and the updated parameter matrix in the pre-training language model are optimized by utilizing the training data set, so that the problems that the model is often over-fitted and low in generalization capability in the final model can be solved by further training the pre-trained large-scale model in the prior art by adopting the technical means, and the over-fitting of the model is avoided and the generalization capability of the model is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of a method for training a pre-training language model using noise perturbations provided in an embodiment of the present application;
FIG. 2 is a flow chart of another method for training a pre-training language model using noise disturbance provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of a pre-training language model according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an apparatus for training a pre-training language model using noise disturbance according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
FIG. 1 is a flow chart of a method for training a pre-trained language model using noise perturbations, as provided in an embodiment of the present application. The method of training the pre-trained language model with noise perturbations of fig. 1 may be performed by a computer or server, or software on a computer or server. As shown in fig. 1, the method for training a pre-training language model by using noise disturbance includes:
s101, acquiring a training data set and a pre-training language model corresponding to a target task;
s102, calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix;
s103, optimizing bias items and updated parameter matrixes in the pre-training language model by utilizing the training data set based on the target task.
The pre-trained language model has a large number of bias terms and parameter matrices. The optimization of the parameter matrix in the pre-training language model is performed later, and the optimization of the updated parameter matrix in the pre-training language model is performed.
The bias term is bias unit or bias term or interference term, which is consistent with the meaning of b in the linear equation y=wx+b. In the linear equation y=wx+b, b represents the intercept of the function on the y-axis, controlling the distance of the function from the origin. The neural network model (the pre-training language model is a pre-training model, the pre-training model is a neural network model after being pre-trained) may also be represented by y=wx+b, unlike the linear equation, W and b in the neural network model represent matrices, and the trainable parameters of the neural network model may also be represented as: (W, b), wherein W represents a parameter matrix and b represents a bias term. Parameters of the neural network model are divided into fixed parameters and trainable parameters, the trainable parameters including: a parameter matrix and a bias term. Training of neural network models is the process of optimizing trainable parameters.
The method and the device can be used in any scene in the language field, such as text translation, word order prediction, next sentence prediction, question and answer tasks, named entity recognition tasks, text classification and the like. For example, in a text translation scenario, the target task is a text translation task; the training data set is a labeling corpus of text translation; the pre-training language model is a model obtained by pre-training the language model based on a text translation task; optimizing a parameter matrix and bias items in the pre-training language model by utilizing a training data set based on a text translation task; the final trained model is used for text translation. Other scenes are similar to the text translation scene.
According to the technical scheme provided by the embodiment of the application, a training data set and a pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the pre-training language model are optimized using the training data set. According to the method and the device, the noise disturbance is added into the parameter matrix, so that the influence of the pre-training on the overfitting and generalization capability of the language model is weakened, and the problems that the overfitting and the generalization capability of the model are low due to the fact that the model is often overfitted and the generalization capability of the model is improved due to the fact that the pre-trained large-scale model is further trained in the prior art can be solved.
Further, the noise disturbance corresponding to each parameter matrix in the pre-training language model is calculated by the following formula:
wherein ,noise disturbance corresponding to the ith parameter matrix, < ->For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in a pre-trained language model, & lt/EN & gt>Is the standard deviation of the data inside the ith parameter matrix.
Further, each parameter matrix is updated by the following formula:
wherein ,noise disturbance corresponding to the ith parameter matrix, < ->For the i-th parameter matrix before updating, < +.>Is the i parameter matrix after updating.
If sum->Is inconsistent in dimension, can be applied to->Filling is performed such that-> and />Is uniform in dimension.
Further, optimizing bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task, comprising: dividing the training data set into a first training data set and a second training data set according to a first preset proportion, and carrying out multi-stage training on the pre-training language model: freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using a first training data set based on a target task to complete first-stage training of the pre-training language model; after the first stage training is completed, thawing the parameter matrix in the pre-training language model, and optimizing the bias items and the parameter matrix in the pre-training language model by using the second training data set based on the target task to complete the second stage training of the pre-training language model.
The first predetermined ratio is related to the ratio of the parameter matrix to the trainable parameter and the ratio of the bias term to the trainable parameter in the pre-training language model. For example, the first preset ratio may be 1:9, and the ratio of the data amount of the first training data set to the second training data set is 1:9.
In this embodiment, the first stage training: freezing the parameter matrix, and training only the bias items by using the first training data set; after the first stage training is completed, thawing the parameter matrix; training in the second stage: and training the bias items and the parameter matrix by using a second training data set, wherein the second stage training is the training of the whole pre-training language model.
Further, optimizing bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task, comprising: dividing the training data set into a first training data set and a second training data set according to a third preset proportion, and carrying out multi-stage training on the pre-training language model: freezing bias items in the pre-training language model, and optimizing a parameter matrix in the pre-training language model by using a first training data set based on a target task so as to complete first-stage training of the pre-training language model; after the first stage training is completed, the bias items in the pre-training language model are unfrozen, and based on the target task, the bias items and the parameter matrix in the pre-training language model are optimized by using the second training data set so as to complete the second stage training of the pre-training language model.
In this embodiment, the first stage training: freezing the bias term, and training only the parameter matrix by using the first training data set; after the first stage training is completed, thawing the bias items; training in the second stage: and training the bias items and the parameter matrix by using a second training data set, wherein the second stage training is the training of the whole pre-training language model.
Further, optimizing bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task, comprising: determining a data volume of the training data set; training the pre-training language model according to the data volume: when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing a training data set based on a target task; and when the data volume is not smaller than the first preset size, optimizing the bias items and the updated parameter matrix in the pre-training language model by utilizing the training data set based on the target task.
The parameter matrix accounts for more than ninety-nine percent of the trainable parameters in the pre-training language model, and the bias term accounts for less than one percent. According to the method, when the data size is smaller than the first preset size, only bias items in the pre-training language model are optimized by utilizing the training data set (the method is applied to a small sample scene which is a condition with a small training sample number), the optimized parameter number and the training time consumption can be greatly reduced, and meanwhile model overfitting can be avoided when the training sample number is small. Through practice, the method can achieve good effect by optimizing only the bias items in the pre-training language model.
Further, optimizing bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task, comprising: determining a data volume of the training data set; training the pre-training language model according to the data volume: when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing a training data set based on a target task; when the data volume is larger than or equal to the first preset size but smaller than the second preset size, freezing the bias items in the pre-training language model, and optimizing an updated parameter matrix in the pre-training language model by utilizing the training data set based on the target task; and when the data volume is larger than or equal to a second preset size, optimizing the bias items and the updated parameter matrix in the pre-training language model by using the training data set based on the target task.
According to the embodiment of the application, the corresponding training method is selected according to the data volume of the training data set, so that training efficiency is improved.
Further, before obtaining the pre-training language model corresponding to the target task, the method further includes: sequentially connecting a plurality of linear layers and nonlinear activation functions to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full connection layer and a classification layer to obtain a language model; and pre-training the language model based on the target task to obtain a pre-trained language model.
And connecting a plurality of linear layers in series and then connecting a nonlinear activation function as a feedforward layer. The residual layer behind the multi-head attention network is used for adding the output of the multi-head attention network and the input of the multi-head attention network; the residual layer after the feedforward layer is used to add the output of the feedforward layer to the input of the feedforward layer.
FIG. 2 is a flow chart of another method for training a pre-trained language model using noise perturbations provided in an embodiment of the present application. As shown in fig. 2, includes:
s201, dividing the training data set into a first training data set, a second training data set and a third training data set according to a second preset proportion, and performing multi-stage training on the pre-training language model:
s202, freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using a first training data set based on a target task to complete first-stage training of the pre-training language model;
s203, after the first-stage training is completed, thawing the parameter matrix in the pre-training language model, freezing the bias items in the pre-training language model, and optimizing the parameter matrix in the pre-training language model by using the second training data set based on the target task so as to complete the second-stage training of the pre-training language model;
S204, after the second stage training is completed, the bias items in the pre-training language model are unfrozen, and the bias items and the parameter matrix in the pre-training language model are optimized by using the third training data set based on the target task, so that the third stage training of the pre-training language model is completed.
The second predetermined ratio is related to the ratio of the parameter matrix to the trainable parameter and the ratio of the bias term to the trainable parameter in the pre-training language model. For example, the first preset ratio may be 1:6:3, and the ratio of the data amounts of the first training data set, the second training data set, and the third training data set is 1:6:3.
First stage training: freezing the parameter matrix, and training only the bias items by using the first training data set; after the first stage training is completed, thawing the parameter matrix; training in the second stage: freezing the bias term, and training the parameter matrix by using the second training data set; after the second stage training is completed, thawing the bias items; training in a third stage: and training the parameter matrix and the bias term by using a third training data set, wherein the third stage training is the training of the whole pre-training language model.
According to the method and the device, the accuracy of the final model can be greatly improved through multi-stage training of the pre-training language model.
In an alternative embodiment, a plurality of linear layers and nonlinear activation functions are sequentially connected to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model; acquiring a training data set corresponding to a target task; calculating noise disturbance corresponding to each parameter matrix in the language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the language model are optimized using the training dataset.
In this embodiment, the language model is not pre-trained, but is directly formally trained. By adopting the technical means, the problem that the model obtained through training often has over-fitting and low generalization capability in the prior art can be solved, so that the over-fitting of the model is avoided and the generalization capability of the model is improved.
Fig. 3 is a schematic structural diagram of a language model according to an embodiment of the present application. As shown in fig. 3, the language model sequentially includes, from an input end to an output end: an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full connection layer and a classification layer.
The residual layer after the feedforward layer is used for adding the output of the feedforward layer and the input of the feedforward layer, and outputting the added result; the residual layer behind the multi-head attention network is used for adding the output of the multi-head attention network and the input of the multi-head attention network, and outputting the added result.
FIG. 3 is also a schematic structural diagram of a pre-trained language model. The pre-trained language model is a pre-trained language model.
The language model may also be a BERT model, XLNET model, roBERTa model, and electrora model, etc. In model training, the optimizers used may be Adam optimizers, adamW optimizers, adaGrad optimizers, RMSProp optimizers.
In an alternative embodiment, a training data set and a pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter; and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.
Further, the noise disturbance corresponding to each network parameter in the pre-trained language model is calculated by the following formula:
wherein ,noise disturbance corresponding to the ith network parameter, < ->For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in a pre-trained language model, & lt/EN & gt>Is the standard deviation of the data within the ith network parameter.
Further, each network parameter is updated by the following formula:
wherein ,noise disturbance corresponding to the ith network parameter, < ->For the i-th network parameter before updating, < +.>Is the i-th updated network parameter.
If sum->Is inconsistent in dimension, can be applied to->Filling is performed such that-> and />Is uniform in dimension.
In an alternative embodiment, a plurality of linear layers and nonlinear activation functions are sequentially connected to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model; acquiring a training data set corresponding to a target task; calculating noise disturbance corresponding to each network parameter in the language model, and updating the network parameter according to the noise disturbance corresponding to each network parameter; based on the target task, the updated network parameters in the language model are optimized using the training dataset.
In an alternative embodiment, a training data set and a pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter; and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.
In an alternative embodiment, a training data set and a pre-training language model corresponding to the target task are obtained; determining a first network parameter and a second network parameter which correspond to the bias item and the parameter matrix in the pre-training language model respectively; calculating noise disturbance corresponding to each second network parameter in the pre-training language model, and updating the second network parameters according to the noise disturbance corresponding to each parameter matrix; based on the target task, the first network parameters and the updated second network parameters in the pre-training language model are optimized using the training data set.
In an alternative embodiment, optimizing bias terms and updated parameter matrices in the language model using the training dataset based on the target task includes: acquiring a trained target language model; inputting a plurality of training samples in a training data set into a language model and a target language model, and respectively outputting a first processing result and a second processing result corresponding to each training sample; calculating contrast loss by using a triplet loss function according to a first processing result and a second processing result corresponding to each training sample and a second processing result corresponding to another training sample with different semantics of the training sample; calculating classification loss by using a cross entropy loss function according to a first processing result and a label corresponding to each training sample; and updating the network parameters of the language model according to the comparison loss and the classification loss so as to complete the training of the language model.
The triplet loss function is triplet (). The first processing result and the second processing result corresponding to a certain training sample are respectively A1 and A2, the second processing result corresponding to another training sample with different semantics of the training sample is A3 (the other training sample with different semantics of the training sample is randomly determined in a training data set), the loss value corresponding to the first language corpus is equal to a triplet (A1, A2 and A3), and the loss values corresponding to all the training samples are added to be a comparison loss. And weighting and summing the comparison loss and the classification loss according to a preset weight to obtain total loss, and updating model parameters of the language model according to the total loss. According to the embodiment of the application, the comparison loss is introduced into model training, so that the problem that the translation model is over-fitted in the prior art can be solved, and the generalization performance of the model is improved.
Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.
The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.
FIG. 4 is a schematic diagram of an apparatus for training a pre-trained language model using noise perturbations, as provided in an embodiment of the present application. As shown in fig. 4, the apparatus for training a pre-trained language model using noise disturbance includes:
an acquisition module 401 configured to acquire a training data set and a pre-training language model corresponding to a target task;
a calculation module 402 configured to calculate a noise disturbance corresponding to each parameter matrix in the pre-training language model, and update the parameter matrix according to the noise disturbance corresponding to each parameter matrix;
the training module 403 is configured to optimize bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task.
The pre-trained language model has a large number of bias terms and parameter matrices. The optimization of the parameter matrix in the pre-training language model is performed later, and the optimization of the updated parameter matrix in the pre-training language model is performed.
The bias term is bias unit or bias term or interference term, which is consistent with the meaning of b in the linear equation y=wx+b. In the linear equation y=wx+b, b represents the intercept of the function on the y-axis, controlling the distance of the function from the origin. The neural network model (the pre-training language model is a pre-training model, the pre-training model is a neural network model after being pre-trained) may also be represented by y=wx+b, unlike the linear equation, W in the neural network model represents a matrix, and the trainable parameters of the neural network model may also be represented as: (W, b), wherein W represents a parameter matrix and b represents a bias term. Parameters of the neural network model are divided into fixed parameters and trainable parameters, the trainable parameters including: a parameter matrix and a bias term. Training of neural network models is the process of optimizing trainable parameters.
The method and the device can be used in any scene in the language field, such as text translation, word order prediction, next sentence prediction, question and answer tasks, named entity recognition tasks, text classification and the like. For example, in a text translation scenario, the target task is a text translation task; the training data set is a labeling corpus of text translation; the pre-training language model is a model obtained by pre-training the language model based on a text translation task; optimizing a parameter matrix and bias items in the pre-training language model by utilizing a training data set based on a text translation task; the final trained model is used for text translation. Other scenes are similar to the text translation scene.
According to the technical scheme provided by the embodiment of the application, a training data set and a pre-training language model corresponding to the target task are obtained; calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the pre-training language model are optimized using the training data set. According to the method and the device, the noise disturbance is added into the parameter matrix, so that the influence of the pre-training on the overfitting and generalization capability of the language model is weakened, and the problems that the overfitting and the generalization capability of the model are low due to the fact that the model is often overfitted and the generalization capability of the model is improved due to the fact that the pre-trained large-scale model is further trained in the prior art can be solved.
Optionally, the calculation module 402 is further configured to calculate the noise disturbance corresponding to each parameter matrix in the pre-trained language model by the following formula:
wherein ,noise disturbance corresponding to the ith parameter matrix, < ->For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in a pre-trained language model, & lt/EN & gt>Is the standard deviation of the data inside the ith parameter matrix.
Optionally, the calculation module 402 is further configured to update each parameter matrix by:
wherein ,noise disturbance corresponding to the ith parameter matrix, < ->For the i-th parameter matrix before updating, < +.>Is the i parameter matrix after updating.
If sum->Is inconsistent in dimension, can be applied to->Filling is performed such that-> and />Is uniform in dimension.
Optionally, the training module 403 is further configured to divide the training data set into a first training data set and a second training data set according to a first preset ratio, and perform multi-stage training on the pre-training language model: freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using a first training data set based on a target task to complete first-stage training of the pre-training language model; after the first stage training is completed, thawing the parameter matrix in the pre-training language model, and optimizing the bias items and the parameter matrix in the pre-training language model by using the second training data set based on the target task to complete the second stage training of the pre-training language model.
The first predetermined ratio is related to the ratio of the parameter matrix to the trainable parameter and the ratio of the bias term to the trainable parameter in the pre-training language model. For example, the first preset ratio may be 1:9, and the ratio of the data amount of the first training data set to the second training data set is 1:9.
In this embodiment, the first stage training: freezing the parameter matrix, and training only the bias items by using the first training data set; after the first stage training is completed, thawing the parameter matrix; training in the second stage: and training the bias items and the parameter matrix by using a second training data set, wherein the second stage training is the training of the whole pre-training language model.
Optionally, the training module 403 is further configured to divide the training data set into the first training data set and the second training data set according to a third preset ratio, and perform multi-stage training on the pre-training language model: freezing bias items in the pre-training language model, and optimizing a parameter matrix in the pre-training language model by using a first training data set based on a target task so as to complete first-stage training of the pre-training language model; after the first stage training is completed, the bias items in the pre-training language model are unfrozen, and based on the target task, the bias items and the parameter matrix in the pre-training language model are optimized by using the second training data set so as to complete the second stage training of the pre-training language model.
In this embodiment, the first stage training: freezing the bias term, and training only the parameter matrix by using the first training data set; after the first stage training is completed, thawing the bias items; training in the second stage: and training the bias items and the parameter matrix by using a second training data set, wherein the second stage training is the training of the whole pre-training language model.
Optionally, the training module 403 is further configured to determine the data amount of the training data set; training the pre-training language model according to the data volume: when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing a training data set based on a target task; and when the data volume is not smaller than the first preset size, optimizing the bias items and the updated parameter matrix in the pre-training language model by utilizing the training data set based on the target task.
The parameter matrix accounts for more than ninety-nine percent of the trainable parameters in the pre-training language model, and the bias term accounts for less than one percent. According to the method, when the data size is smaller than the first preset size, only bias items in the pre-training language model are optimized by utilizing the training data set (the method is applied to a small sample scene which is a condition with a small training sample number), the optimized parameter number and the training time consumption can be greatly reduced, and meanwhile model overfitting can be avoided when the training sample number is small. Through practice, the method can achieve good effect by optimizing only the bias items in the pre-training language model.
Optionally, the training module 403 is further configured to determine the data amount of the training data set; training the pre-training language model according to the data volume: when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing a training data set based on a target task; when the data volume is larger than or equal to the first preset size but smaller than the second preset size, freezing the bias items in the pre-training language model, and optimizing an updated parameter matrix in the pre-training language model by utilizing the training data set based on the target task; and when the data volume is larger than or equal to a second preset size, optimizing the bias items and the updated parameter matrix in the pre-training language model by using the training data set based on the target task.
According to the embodiment of the application, the corresponding training method is selected according to the data volume of the training data set, so that training efficiency is improved.
Optionally, the obtaining module 401 is further configured to sequentially connect the plurality of linear layers and the nonlinear activation function to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full connection layer and a classification layer to obtain a language model; and pre-training the language model based on the target task to obtain a pre-trained language model.
And connecting a plurality of linear layers in series and then connecting a nonlinear activation function as a feedforward layer. The residual layer behind the multi-head attention network is used for adding the output of the multi-head attention network and the input of the multi-head attention network; the residual layer after the feedforward layer is used to add the output of the feedforward layer to the input of the feedforward layer.
Optionally, the training module 403 is further configured to divide the training data set into the first training data set, the second training data set and the third training data set according to a second preset ratio, and perform multi-stage training on the pre-training language model: freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using a first training data set based on a target task to complete first-stage training of the pre-training language model; after the first-stage training is completed, thawing the parameter matrix in the pre-training language model, freezing the bias items in the pre-training language model, and optimizing the parameter matrix in the pre-training language model by utilizing a second training data set based on the target task so as to complete the second-stage training of the pre-training language model; after the second stage training is completed, the bias items in the pre-training language model are unfrozen, and based on the target task, the bias items and the parameter matrix in the pre-training language model are optimized by utilizing the third training data set so as to complete the third stage training of the pre-training language model.
The second predetermined ratio is related to the ratio of the parameter matrix to the trainable parameter and the ratio of the bias term to the trainable parameter in the pre-training language model. For example, the first preset ratio may be 1:6:3, and the ratio of the data amounts of the first training data set, the second training data set, and the third training data set is 1:6:3.
First stage training: freezing the parameter matrix, and training only the bias items by using the first training data set; after the first stage training is completed, thawing the parameter matrix; training in the second stage: freezing the bias term, and training the parameter matrix by using the second training data set; after the second stage training is completed, thawing the bias items; training in a third stage: and training the parameter matrix and the bias term by using a third training data set, wherein the third stage training is the training of the whole pre-training language model.
According to the method and the device, the accuracy of the final model can be greatly improved through multi-stage training of the pre-training language model.
Optionally, the training module 403 is further configured to sequentially connect the plurality of linear layers and the nonlinear activation function to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model; acquiring a training data set corresponding to a target task; calculating noise disturbance corresponding to each parameter matrix in the language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix; based on the target task, bias items and updated parameter matrices in the language model are optimized using the training dataset.
In this embodiment, the language model is not pre-trained, but is directly formally trained. By adopting the technical means, the problem that the model obtained through training often has over-fitting and low generalization capability in the prior art can be solved, so that the over-fitting of the model is avoided and the generalization capability of the model is improved.
Optionally, the training module 403 is further configured to obtain a training data set and a pre-training language model corresponding to the target task; calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter; and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.
Optionally, the calculation module 402 is further configured to calculate a noise disturbance corresponding to each network parameter in the pre-trained language model by the following formula:
wherein ,noise disturbance corresponding to the ith network parameter, < ->For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in a pre-trained language model, & lt/EN & gt >Is the standard deviation of the data within the ith network parameter.
Optionally, the computing module 402 is further configured to update each network parameter by:
wherein ,noise disturbance corresponding to the ith network parameter, < ->For the i-th network parameter before updating, < +.>Is the i-th updated network parameter.
If sum->Is inconsistent in dimension, can be applied to->Filling is performed such that-> and />Is uniform in dimension.
Optionally, the training module 403 is further configured to sequentially connect the plurality of linear layers and the nonlinear activation function to obtain a feedforward layer; sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model; acquiring a training data set corresponding to a target task; calculating noise disturbance corresponding to each network parameter in the language model, and updating the network parameter according to the noise disturbance corresponding to each network parameter; based on the target task, the updated network parameters in the language model are optimized using the training dataset.
Optionally, the training module 403 is further configured to obtain a training data set and a pre-training language model corresponding to the target task; calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter; and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.
Optionally, the training module 403 is further configured to obtain a training data set and a pre-training language model corresponding to the target task; determining a first network parameter and a second network parameter which correspond to the bias item and the parameter matrix in the pre-training language model respectively; calculating noise disturbance corresponding to each second network parameter in the pre-training language model, and updating the second network parameters according to the noise disturbance corresponding to each parameter matrix; based on the target task, the first network parameters and the updated second network parameters in the pre-training language model are optimized using the training data set.
Optionally, the training module 403 is further configured to obtain a target language model that has been trained; inputting a plurality of training samples in a training data set into a language model and a target language model, and respectively outputting a first processing result and a second processing result corresponding to each training sample; calculating contrast loss by using a triplet loss function according to a first processing result and a second processing result corresponding to each training sample and a second processing result corresponding to another training sample with different semantics of the training sample; calculating classification loss by using a cross entropy loss function according to a first processing result and a label corresponding to each training sample; and updating the network parameters of the language model according to the comparison loss and the classification loss so as to complete the training of the language model.
The triplet loss function is triplet (). The first processing result and the second processing result corresponding to a certain training sample are respectively A1 and A2, the second processing result corresponding to another training sample with different semantics of the training sample is A3 (the other training sample with different semantics of the training sample is randomly determined in a training data set), the loss value corresponding to the first language corpus is equal to a triplet (A1, A2 and A3), and the loss values corresponding to all the training samples are added to be a comparison loss. And weighting and summing the comparison loss and the classification loss according to a preset weight to obtain total loss, and updating model parameters of the language model according to the total loss. According to the embodiment of the application, the comparison loss is introduced into model training, so that the problem that the translation model is over-fitted in the prior art can be solved, and the generalization performance of the model is improved.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.
Fig. 5 is a schematic diagram of an electronic device 5 provided in an embodiment of the present application. As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: a processor 501, a memory 502 and a computer program 503 stored in the memory 502 and executable on the processor 501. The steps of the various method embodiments described above are implemented by processor 501 when executing computer program 503. Alternatively, the processor 501, when executing the computer program 503, performs the functions of the modules/units in the above-described apparatus embodiments.
The electronic device 5 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 5 may include, but is not limited to, a processor 501 and a memory 502. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the electronic device 5 and is not limiting of the electronic device 5 and may include more or fewer components than shown, or different components.
The processor 501 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
The memory 502 may be an internal storage unit of the electronic device 5, for example, a hard disk or a memory of the electronic device 5. The memory 502 may also be an external storage device of the electronic device 5, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 5. Memory 502 may also include both internal storage units and external storage devices of electronic device 5. The memory 502 is used to store computer programs and other programs and data required by the electronic device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow in the methods of the above embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program may implement the steps of the respective method embodiments described above when executed by a processor. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.
Claims (10)
1. A method for training a pre-trained language model using noise perturbations, comprising:
acquiring a training data set and a pre-training language model corresponding to a target task;
calculating noise disturbance corresponding to each parameter matrix in the pre-training language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix;
and optimizing bias items and updated parameter matrixes in the pre-training language model by utilizing the training data set based on the target task.
2. The method of claim 1, wherein the noise disturbance for each parameter matrix in the pre-trained language model is calculated by the formula:
wherein ,noise disturbance corresponding to the ith parameter matrix, < ->For uniformly distributed noise ranging from-lambda/2 to lambda/2, lambda is a super parameter controlling noise intensity in the pre-trained language model, < +.>Is the standard deviation of the data inside the ith parameter matrix.
4. The method of claim 1, wherein optimizing bias terms and updated parameter matrices in the pre-trained language model using the training dataset based on the target task comprises:
dividing the training data set into a first training data set and a second training data set according to a first preset proportion, and carrying out multi-stage training on the pre-training language model:
freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using the first training data set based on the target task so as to complete first-stage training of the pre-training language model;
After the first-stage training is completed, thawing the parameter matrix in the pre-training language model, and optimizing the bias items and the parameter matrix in the pre-training language model by using the second training data set based on the target task so as to complete the second-stage training of the pre-training language model.
5. The method of claim 1, wherein optimizing bias terms and updated parameter matrices in the pre-trained language model using the training dataset based on the target task comprises:
dividing the training data set into a first training data set, a second training data set and a third training data set according to a second preset proportion, and carrying out multi-stage training on the pre-training language model:
freezing a parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by using the first training data set based on the target task so as to complete first-stage training of the pre-training language model;
after the first-stage training is completed, thawing the parameter matrix in the pre-training language model, freezing the bias items in the pre-training language model, and optimizing the parameter matrix in the pre-training language model by utilizing the second training data set based on the target task so as to complete the second-stage training of the pre-training language model;
After the second stage training is completed, bias items in the pre-training language model are unfrozen, and based on the target task, the bias items and the parameter matrix in the pre-training language model are optimized by using the third training data set so as to complete the third stage training of the pre-training language model.
6. The method of claim 1, wherein optimizing bias terms and updated parameter matrices in the pre-trained language model using the training dataset based on the target task comprises:
determining a data volume of the training data set;
training the pre-training language model according to the data volume:
when the data volume is smaller than a first preset size, freezing the updated parameter matrix in the pre-training language model, and optimizing bias items in the pre-training language model by utilizing the training data set based on the target task;
and when the data volume is not smaller than the first preset size, optimizing bias items and updated parameter matrixes in the pre-training language model by utilizing the training data set based on the target task.
7. The method of claim 1, wherein prior to obtaining the pre-trained language model corresponding to the target task, the method further comprises:
Sequentially connecting a plurality of linear layers and nonlinear activation functions to obtain a feedforward layer;
sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full connection layer and a classification layer to obtain a language model;
and pre-training the language model based on the target task to obtain the pre-training language model.
8. The method according to claim 1, wherein the method further comprises:
sequentially connecting a plurality of linear layers and nonlinear activation functions to obtain a feedforward layer;
sequentially connecting an embedded layer, a multi-head attention network, a residual layer, a normalization layer, a feedforward layer, a residual layer, a normalization layer, a full-connection layer and a classification layer to obtain language networks, and serially connecting a plurality of language networks to obtain a language model;
acquiring a training data set corresponding to a target task;
calculating noise disturbance corresponding to each parameter matrix in the language model, and updating the parameter matrix according to the noise disturbance corresponding to each parameter matrix;
based on the target task, bias items and updated parameter matrices in the language model are optimized using the training dataset.
9. The method according to claim 1, wherein the method further comprises:
Acquiring a training data set and a pre-training language model corresponding to a target task;
calculating noise disturbance corresponding to network parameters of each network layer in the pre-training language model, and updating the network parameters according to the noise disturbance corresponding to each network parameter;
and optimizing updated network parameters in the pre-training language model by utilizing the training data set based on the target task so as to complete training of the pre-training language model.
10. An apparatus for training a pre-trained language model using noise perturbations, comprising:
the acquisition module is configured to acquire a training data set and a pre-training language model corresponding to the target task;
the computing module is configured to compute noise disturbance corresponding to each parameter matrix in the pre-training language model and update the parameter matrix according to the noise disturbance corresponding to each parameter matrix;
a training module configured to optimize bias terms and updated parameter matrices in the pre-training language model using the training dataset based on the target task.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310614779.5A CN116362351B (en) | 2023-05-29 | 2023-05-29 | Method and device for training pre-training language model by using noise disturbance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310614779.5A CN116362351B (en) | 2023-05-29 | 2023-05-29 | Method and device for training pre-training language model by using noise disturbance |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116362351A true CN116362351A (en) | 2023-06-30 |
CN116362351B CN116362351B (en) | 2023-09-26 |
Family
ID=86939890
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310614779.5A Active CN116362351B (en) | 2023-05-29 | 2023-05-29 | Method and device for training pre-training language model by using noise disturbance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116362351B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116522152A (en) * | 2023-07-05 | 2023-08-01 | 深圳须弥云图空间科技有限公司 | Translation model training method and device based on back translation |
CN116595130A (en) * | 2023-07-18 | 2023-08-15 | 深圳须弥云图空间科技有限公司 | Corpus expansion method and device under multiple tasks based on small language model |
CN116595385A (en) * | 2023-07-18 | 2023-08-15 | 深圳须弥云图空间科技有限公司 | Composition generation model training method and device |
CN116603249A (en) * | 2023-07-19 | 2023-08-18 | 深圳须弥云图空间科技有限公司 | Training method of large language model applied to role playing reasoning game |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109919320A (en) * | 2019-01-23 | 2019-06-21 | 西北工业大学 | Triplet online learning methods based on Semantic hierarchy |
US20190355366A1 (en) * | 2018-05-18 | 2019-11-21 | Emotech Ltd | Speaker recognition |
CN112070010A (en) * | 2020-09-08 | 2020-12-11 | 长沙理工大学 | Pedestrian re-recognition method combining multi-loss dynamic training strategy to enhance local feature learning |
CN112183468A (en) * | 2020-10-27 | 2021-01-05 | 南京信息工程大学 | Pedestrian re-identification method based on multi-attention combined multi-level features |
CN113052324A (en) * | 2021-03-24 | 2021-06-29 | 支付宝(杭州)信息技术有限公司 | User abnormal pattern recognition method, device and equipment |
CN113111663A (en) * | 2021-04-28 | 2021-07-13 | 东南大学 | Abstract generation method fusing key information |
CN113468854A (en) * | 2021-06-24 | 2021-10-01 | 浙江华巽科技有限公司 | Multi-document automatic abstract generation method |
US20210319176A1 (en) * | 2020-04-13 | 2021-10-14 | Capital One Services, Llc | Efficient automatic punctuation with robust inference |
CN114818902A (en) * | 2022-04-21 | 2022-07-29 | 浪潮云信息技术股份公司 | Text classification method and system based on knowledge distillation |
CN114972904A (en) * | 2022-04-18 | 2022-08-30 | 北京理工大学 | Zero sample knowledge distillation method and system based on triple loss resistance |
CN115734029A (en) * | 2022-11-07 | 2023-03-03 | 中国电信股份有限公司 | Terminal suitability judgment method and device, electronic equipment and storage medium |
-
2023
- 2023-05-29 CN CN202310614779.5A patent/CN116362351B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190355366A1 (en) * | 2018-05-18 | 2019-11-21 | Emotech Ltd | Speaker recognition |
CN109919320A (en) * | 2019-01-23 | 2019-06-21 | 西北工业大学 | Triplet online learning methods based on Semantic hierarchy |
US20210319176A1 (en) * | 2020-04-13 | 2021-10-14 | Capital One Services, Llc | Efficient automatic punctuation with robust inference |
CN112070010A (en) * | 2020-09-08 | 2020-12-11 | 长沙理工大学 | Pedestrian re-recognition method combining multi-loss dynamic training strategy to enhance local feature learning |
CN112183468A (en) * | 2020-10-27 | 2021-01-05 | 南京信息工程大学 | Pedestrian re-identification method based on multi-attention combined multi-level features |
CN113052324A (en) * | 2021-03-24 | 2021-06-29 | 支付宝(杭州)信息技术有限公司 | User abnormal pattern recognition method, device and equipment |
CN113111663A (en) * | 2021-04-28 | 2021-07-13 | 东南大学 | Abstract generation method fusing key information |
CN113468854A (en) * | 2021-06-24 | 2021-10-01 | 浙江华巽科技有限公司 | Multi-document automatic abstract generation method |
CN114972904A (en) * | 2022-04-18 | 2022-08-30 | 北京理工大学 | Zero sample knowledge distillation method and system based on triple loss resistance |
CN114818902A (en) * | 2022-04-21 | 2022-07-29 | 浪潮云信息技术股份公司 | Text classification method and system based on knowledge distillation |
CN115734029A (en) * | 2022-11-07 | 2023-03-03 | 中国电信股份有限公司 | Terminal suitability judgment method and device, electronic equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
CHUHANWU ET AL: "NoisyTune: A Little Noise Can Help You Finetune Pretrained Language Models Better", pages 1 - 6, Retrieved from the Internet <URL:https://arxiv.org/pdf/2202.12024.pdf> * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116522152A (en) * | 2023-07-05 | 2023-08-01 | 深圳须弥云图空间科技有限公司 | Translation model training method and device based on back translation |
CN116522152B (en) * | 2023-07-05 | 2023-11-10 | 深圳须弥云图空间科技有限公司 | Translation model training method and device based on back translation |
CN116595130A (en) * | 2023-07-18 | 2023-08-15 | 深圳须弥云图空间科技有限公司 | Corpus expansion method and device under multiple tasks based on small language model |
CN116595385A (en) * | 2023-07-18 | 2023-08-15 | 深圳须弥云图空间科技有限公司 | Composition generation model training method and device |
CN116595385B (en) * | 2023-07-18 | 2023-10-03 | 深圳须弥云图空间科技有限公司 | Composition generation model training method and device |
CN116595130B (en) * | 2023-07-18 | 2024-02-20 | 深圳须弥云图空间科技有限公司 | Corpus expansion method and device under multiple tasks based on small language model |
CN116603249A (en) * | 2023-07-19 | 2023-08-18 | 深圳须弥云图空间科技有限公司 | Training method of large language model applied to role playing reasoning game |
CN116603249B (en) * | 2023-07-19 | 2023-10-03 | 深圳须弥云图空间科技有限公司 | Training method of large language model applied to role playing reasoning game |
Also Published As
Publication number | Publication date |
---|---|
CN116362351B (en) | 2023-09-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116362351B (en) | Method and device for training pre-training language model by using noise disturbance | |
US20230368024A1 (en) | Neural architecture search | |
US20210232929A1 (en) | Neural architecture search | |
EP3564863B1 (en) | Apparatus for executing lstm neural network operation, and operational method | |
WO2021089012A1 (en) | Node classification method and apparatus for graph network model, and terminal device | |
US20200410365A1 (en) | Unsupervised neural network training using learned optimizers | |
EP3362951B1 (en) | Neural random access machine | |
US20220004849A1 (en) | Image processing neural networks with dynamic filter activation | |
CN112116104B (en) | Method, device, medium and electronic equipment for automatically integrating machine learning | |
CN116403250A (en) | Face recognition method and device with shielding | |
CN116912635B (en) | Target tracking method and device | |
CN116595130B (en) | Corpus expansion method and device under multiple tasks based on small language model | |
CN113850298A (en) | Image identification method and device and related equipment | |
CN116542328B (en) | Knowledge distillation method and device for CTR prediction model | |
CN116629342A (en) | Model bypass optimization method and device | |
CN116610788A (en) | Method and device for training pre-training language model based on data volume of training data | |
CN116341640B (en) | Text processing model training method and device | |
TWI763975B (en) | System and method for reducing computational complexity of artificial neural network | |
CN116502640B (en) | Text characterization model training method and device based on context | |
CN117474037B (en) | Knowledge distillation method and device based on space distance alignment | |
CN116151232B (en) | Method and device for generating model by multi-stage training text title | |
CN116306791A (en) | Text processing method and device for improving self-attention model | |
CN118504658A (en) | Pre-training federal learning fine tuning method, system, electronic equipment and storage medium | |
CN116628204A (en) | Method and device for training text classification model based on training data volume | |
CN118333125A (en) | Fine tuning training method and device for image generation model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |