Disclosure of Invention
One or more embodiments of the present disclosure describe a training method and apparatus for a neural network model for question-answer matching, which can reduce resource consumption and increase processing speed on the basis of accurately identifying a user question.
In a first aspect, a training method for a neural network model for question-answer matching is provided, the method comprising:
acquiring classification labels corresponding to all user questions in a sample set;
predicting a first probability score of each user question on each category by using a trained first neural network model, wherein the number of layers of the first neural network model is N;
predicting second probability scores of all user questions on all classifications by using a second neural network model to be trained, wherein the number of layers of the second neural network model is M, and M < N;
obtaining a first loss function according to the second probability score and the first probability score;
obtaining a second loss function according to the second probability score and the classification labels of the questions of each user;
combining the first loss function and the second loss function to obtain a total loss function;
and training the second neural network model according to the total loss function to obtain a primarily trained second neural network model.
In one possible implementation, the first neural network model is pre-trained by:
and training the first neural network model by taking each user question and the classification label corresponding to each user question as a group of training samples to obtain the trained first neural network model.
In a possible implementation manner, the obtaining a first loss function according to the second probability score and the first probability score includes:
dividing the second probability score by a preset parameter, and carrying out normalization processing to obtain a first output value of each user question;
obtaining a first loss function according to the first output value of each user question and the first probability score of each user question; the first probability score is obtained by dividing the first probability score by the preset parameter and carrying out normalization processing.
In a possible implementation manner, the obtaining a second loss function according to the second probability score and the classification labels of the questions of the respective users includes:
normalizing the second probability score to obtain a second output value of each user question;
and obtaining a second loss function according to the second output value of each user question and the classification label of each user question.
In a possible implementation manner, the combining the first loss function and the second loss function to obtain a total loss function includes:
multiplying the first loss function by a first weight, multiplying the second loss function by a second weight, and summing the two to obtain a total loss function, wherein the first weight is greater than the second weight.
In one possible embodiment, after the obtaining the primarily trained second neural network model, the method further includes:
taking each user question and the classification label corresponding to each user question as a group of training samples, and continuing training the primarily trained second neural network model to obtain a continuously trained second neural network model.
Further, the method further comprises:
and predicting the category to which the current user question belongs by using the second neural network model after continuing training.
In one possible implementation manner, the second neural network model to be trained is a pre-trained contextual omnidirectional prediction model, and the pre-training task of the second neural network model comprises two tasks of shape filling and context judgment.
In one possible implementation, the number of layers of the second neural network model is 2.
In a second aspect, there is provided a training apparatus for a neural network model for question-answer matching, the apparatus comprising:
the acquisition unit is used for acquiring each user question in the sample set and the classification label corresponding to each user question;
the first prediction unit is used for predicting first probability scores of all user questions on all classifications by using a trained first neural network model, wherein the number of layers of the first neural network model is N;
the second prediction unit is used for predicting second probability scores of all user questions on all classifications by using a second neural network model to be trained, wherein the number of layers of the second neural network model is M, and M < N;
the first comparison unit is used for obtaining a first loss function according to the second probability score predicted by the second prediction unit and the first probability score predicted by the first prediction unit;
the second comparison unit is used for obtaining a second loss function according to the second probability score predicted by the second prediction unit and the classification labels of the user questions acquired by the acquisition unit;
the combining unit is used for combining the first loss function obtained by the first comparing unit with the second loss function obtained by the second comparing unit to obtain a total loss function;
and the first training unit is used for training the second neural network model according to the total loss function obtained by the combining unit to obtain a primarily trained second neural network model.
In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
In a fourth aspect, there is provided a computing device comprising a memory having executable code stored therein and a processor which, when executing the executable code, implements the method of the first aspect.
According to the method and the device provided by the embodiment of the specification, different from a common way of training the question-answer matching model, when the second neural network model is trained, the prediction result of the trained first neural network model is utilized, wherein the first neural network model has a complex structure relative to the second neural network model, and the training of the second neural network model is induced by introducing the prediction result of the first neural network model, so that knowledge migration is realized, and the second neural network model can reduce resource consumption and improve processing speed on the basis of accurately identifying a user question, that is, by the way of training the question-answer matching model, a large amount of operation resources are saved, and the model effect is basically not different from that before.
Detailed Description
The following describes the scheme provided in the present specification with reference to the drawings.
Fig. 1 is a schematic diagram of an implementation scenario of an embodiment disclosed in the present specification. This implementation scenario involves the training of a neural network model for question-answer matching, which may also be referred to as a question-answer matching model. For a long time, the accuracy of question and answer matching models for identifying user questions and the processing speed are a pair of contradictions. If a Big Model (s)) with more layers is used, the accuracy of user question recognition is higher, but the processing speed is low; if a small Model (Smal Model) with a small number of layers is used, the processing speed is high, but the accuracy of user question recognition is low. For the question-answer matching model, the method is generally applied to real-time solution of a user question by a robot customer service, so that the accuracy and the processing speed of the user question identification are both high. According to the embodiment of the specification, a solution is provided for the contradiction, and the idea of knowledge distillation is introduced into the training process of the question-answer matching model, so that the accuracy and the processing speed of identifying the question sentences of the user can meet the requirements by using the trained small model.
Knowledge distillation achieves knowledge migration by introducing a soft target (soft target) associated with the teacher network as part of the total loss function (total loss) to induce training of the student network. Wherein, the teacher network is complex, but the reasoning performance is superior; the student network is simple and low in complexity.
As shown in FIG. 1, the predicted output of the teacher network (i.e., large model) divided by the preset parameters T (divided by T) and then normalized (e.g., softmax transformed) may yield a softened probability distribution (i.e., soft target), e.g., s i [0.1,0.6,…,0.1]The value of the preset parameter T is between 0 and 1, and the value distribution is mild. The larger the value of the preset parameter T is, the more gentle the distribution is; however, if the value of the preset parameter T is too small, the probability of erroneous classification may be amplified, and unnecessary noise may be introduced. The hard target is then the true annotation of the sample, which can be represented by a one-hot vector, e.g., y i [0,1,…,0]. The total loss function (total loss) is designed as a weighted average of the cross entropy corresponding to the soft target and the hard target, wherein the larger the weighted coefficient lambda of the cross entropy of the soft target is, the more dependent the migration induction isThe contribution of the teacher's network is necessary for the early stage of training, helping the student's network to identify simple samples more easily, but the later stage of training needs to reduce the specific gravity of soft targets properly, letting true labels help identify difficult samples. In addition, the reasoning performance of the teacher network is generally superior to that of the student network, the model capacity is not particularly limited, and the higher the reasoning precision of the teacher network is, the more favorable the learning of the student network is.
According to the embodiment of the specification, through knowledge migration, a small model which is more suitable for reasoning is obtained through the trained large model. The trained small model can be used for carrying out question-answer matching on the user question, namely predicting (predicting) the category of the user question. It is understood that the input of the model may be a vector of user questions.
Fig. 2 illustrates a flowchart of a training method for a neural network model for question-answer matching, which may be based on the application scenario illustrated in fig. 1, according to one embodiment. As shown in fig. 2, the training method for the neural network model for question-answer matching in this embodiment includes the steps of: step 21, acquiring each user question in a sample set and a classification label corresponding to each user question; step 22, predicting a first probability score of each user question on each category by using a trained first neural network model, wherein the number of layers of the first neural network model is N; step 23, predicting second probability scores of all user questions on all classifications by using a second neural network model to be trained, wherein the number of layers of the second neural network model is M, and M < N; step 24, obtaining a first loss function according to the second probability score and the first probability score; step 25, obtaining a second loss function according to the second probability score and the classification labels of the questions of each user; step 26, combining the first loss function and the second loss function to obtain a total loss function; and step 27, training the second neural network model according to the total loss function to obtain a primarily trained second neural network model. Specific implementations of the above steps are described below.
First, in step 21, each user question in the sample set and a classification label corresponding to each user question are obtained. It can be understood that the classification tag can be understood as a hard object in the application scenario shown in fig. 1, and when there are multiple classifications, the classification tag corresponding to each user question is uniquely determined. For example, the classification labels corresponding to the respective user questions may be as shown in table one.
Table one: correspondence table between user question and classification label
Question of user
|
Classification label
|
User question 1
|
Classification 1
|
User question 2
|
Classification 1
|
User question 3
|
Classification 2
|
User question 4
|
Classification 3 |
Referring to table one, the classification tags corresponding to the user question 1 and the user question 2 are classified as class 1, that is, different user questions may correspond to the same classification tag, but the classification tag corresponding to one user question is unique.
Next, at step 22, a first probability score for each user question over each category is predicted using a trained first neural network model, where the number of layers of the first neural network model is N. It is understood that the first neural network model may be understood as a large model in the application scenario shown in fig. 1, and the first probability score may be understood as a soft target in the application scenario shown in fig. 1.
In one example, the first neural network model is pre-trained by:
and training the first neural network model by taking each user question and the classification label corresponding to each user question as a group of training samples to obtain the trained first neural network model.
In one example, the first neural network model classifies user questions using a complete converter-based bi-directional encoder characterization (bidirectional encoder representations from transformers, bert) model and outputs knowledge points of user question matches.
Then, in step 23, a second probability score of each user question on each category is predicted by using a second neural network model to be trained, wherein the number of layers of the second neural network model is M, and M < N. It will be appreciated that the second neural network model may be understood as a small model in the application scenario shown in fig. 1, and the second probability score may be understood as a predicted result of the second neural network model to be trained, and the second probability score may not be accurate enough with respect to the first probability score since the second neural network model has not been trained yet.
In one example, the second neural network model to be trained is a pre-trained contextual omnidirectional prediction model, such as a bert model, and the pre-training tasks of the second neural network model include two tasks of shape filling and sentence judging.
In one example, the number of layers of the second neural network model is 2, e.g., a 2-layer bert model, which is about one sixth of the complete bert model for consumption of computing resources.
And in step 24, a first loss function is obtained according to the second probability score and the first probability score. It is to be appreciated that the first loss function described above may be, but is not limited to, employing a cross entropy loss function (cross entropy loss).
Referring to the application scenario shown in fig. 1, in an example, after dividing the second probability score by a predetermined parameter, performing normalization processing to obtain a first output value of each user question; obtaining a first loss function according to the first output value of each user question and the first probability score of each user question; the first probability score is obtained by dividing the output of the preset level of the first neural network model by the preset parameter and performing normalization processing.
And in step 25, obtaining a second loss function according to the second probability score and the classification labels of the questions of each user. It is understood that the second loss function described above may be, but is not limited to, the use of a cross entropy loss function.
Referring to the application scenario shown in fig. 1, in an example, the second probability score is normalized to obtain a second output value of each user question; and obtaining a second loss function according to the second output value of each user question and the classification label of each user question.
And then, in step 26, the first loss function and the second loss function are combined to obtain a total loss function. It is understood that the manner of combining may be, but is not limited to, a weighted summation manner.
In one example, the first loss function is multiplied by a first weight, the second loss function is multiplied by a second weight, and the two are summed to obtain a total loss function, wherein the first weight is greater than the second weight.
Finally, in step 27, the second neural network model is trained according to the total loss function, so as to obtain a primarily trained second neural network model. It will be appreciated that the model can be solved and evaluated by minimizing the loss function.
In one example, after step 27, training is continued on the preliminarily trained second neural network model using each user question and the classification label corresponding to each user question as a set of training samples, to obtain a second neural network model after continuing training.
It will be appreciated that the total loss function is designed as a weighted average of the cross entropy of soft and hard targets, where a greater weighting coefficient of the cross entropy of soft targets indicates that migration induction is more dependent on the teacher's network contribution, which is necessary for early training, helping to allow the student's network to more easily identify simple samples, but later training requires a proper reduction in the specific gravity of soft targets, allowing the classification labels to help identify difficult samples.
Further, predicting the category to which the current user question belongs by using the second neural network model after continuing training.
According to the method provided by the embodiment of the specification, different from a common way of training a question-answer matching model, when the second neural network model is trained, the prediction result of the trained first neural network model is utilized, wherein the first neural network model is complex in structure relative to the second neural network model, and the second neural network model is induced to be trained by introducing the prediction result of the first neural network model, so that knowledge migration is realized, and the second neural network model can reduce resource consumption and improve processing speed on the basis of accurately identifying a user question, namely, by the way of training the question-answer matching model, a large amount of operation resources are saved, and the model effect is basically not different from that before.
According to an embodiment of another aspect, there is further provided a training apparatus for a neural network model for question-answer matching, which is used for executing the training method for the neural network model for question-answer matching provided in the embodiments of the present specification. FIG. 3 illustrates a schematic block diagram of a training apparatus for a neural network model for question-answer matching, according to one embodiment. As shown in fig. 3, the apparatus 300 includes:
an obtaining unit 31, configured to obtain each user question in the sample set and a classification label corresponding to each user question;
a first prediction unit 32, configured to predict a first probability score of each user question on each category by using a trained first neural network model, where the number of layers of the first neural network model is N;
a second prediction unit 33, configured to predict a second probability score of each user question on each category by using a second neural network model to be trained, where the number of layers of the second neural network model is M, M < N;
a first comparing unit 34, configured to obtain a first loss function according to the second probability score predicted by the second predicting unit 33 and the first probability score predicted by the first predicting unit 32;
a second comparing unit 35, configured to obtain a second loss function according to the second probability score predicted by the second predicting unit 33 and the classification labels of the question sentences of each user acquired by the acquiring unit 31;
a combining unit 36, configured to combine the first loss function obtained by the first comparing unit 34 with the second loss function obtained by the second comparing unit 35 to obtain a total loss function;
a first training unit 37, configured to train the second neural network model according to the total loss function obtained by the combining unit 36, to obtain a primarily trained second neural network model.
Optionally, as an embodiment, the first neural network model is pre-trained by:
and training the first neural network model by taking each user question and the classification label corresponding to each user question as a group of training samples to obtain the trained first neural network model.
Optionally, as an embodiment, the first comparing unit 34 is specifically configured to:
dividing the second probability score by a preset parameter, and carrying out normalization processing to obtain a first output value of each user question;
obtaining a first loss function according to the first output value of each user question and the first probability score of each user question; the first probability score is obtained by dividing the first probability score by the preset parameter and carrying out normalization processing.
Optionally, as an embodiment, the second comparing unit 35 is specifically configured to:
normalizing the second probability score to obtain a second output value of each user question;
and obtaining a second loss function according to the second output value of each user question and the classification label of each user question.
Optionally, as an embodiment, the combining unit 36 is specifically configured to multiply the first loss function obtained by the first comparing unit 34 by a first weight, multiply the second loss function obtained by the second comparing unit 35 by a second weight, and sum the two to obtain a total loss function, where the first weight is greater than the second weight.
Optionally, as an embodiment, the apparatus further includes:
the second training unit is configured to, after the first training unit obtains the preliminarily trained second neural network model, use each user question and the classification label corresponding to each user question obtained by the obtaining unit 31 as a set of training samples, and continuously train the preliminarily trained second neural network model obtained by the first training unit to obtain a continuously trained second neural network model.
Further, the apparatus further comprises:
and the predicting unit is used for predicting the category to which the current user question belongs by using the second neural network model which is obtained by the second training unit and is subjected to continuous training.
Optionally, as an embodiment, the second neural network model to be trained is a pre-trained contextual omnidirectional prediction model, and the pre-training task of the second neural network model includes two tasks of shape filling and context judgment.
Alternatively, as an embodiment, the number of layers of the second neural network model is 2.
According to the device provided by the embodiment of the specification, different from a common way of training a question-answer matching model, when the second neural network model is trained, the prediction result of the trained first neural network model is utilized, wherein the first neural network model has a complex structure relative to the second neural network model, and the training of the second neural network model is induced by introducing the prediction result of the first neural network model, so that knowledge migration is realized, and the second neural network model can reduce resource consumption and improve processing speed on the basis of accurately identifying a user question, namely, by the way of training the question-answer matching model, a large amount of operation resources are saved, and the model effect is basically not different from that before.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, implements the method described in connection with fig. 2.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.