CN112686046A

CN112686046A - Model training method, device, equipment and computer readable medium

Info

Publication number: CN112686046A
Application number: CN202110014739.8A
Authority: CN
Inventors: 白强伟; 黄艳香
Original assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Current assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date: 2021-01-06
Filing date: 2021-01-06
Publication date: 2021-04-20

Abstract

The application relates to a model training method, a device, equipment and a computer readable medium. The method comprises the following steps: inputting a training sample into a first model, and acquiring a first recognition result output by a full connection layer of the first model; inputting the training sample and the first recognition result into a second model, and obtaining a second recognition result output by a full connection layer of the second model, wherein the parameter quantity of the first model is greater than that of the second model, and the recognition accuracy of the first model is greater than that of the second model; constructing a target loss function by using the first recognition result, the second recognition result and the pre-labeled data of the training sample; and adjusting parameters in the second model by using the target loss function so that the identification accuracy of the second model reaches a target threshold value, and the identification is accurate when the output result of the second model is the same as the output result of the first model. The method and the device solve the problem that the identification efficiency is low due to the large number of model parameters.

Description

Model training method, device, equipment and computer readable medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a model training method, apparatus, device, and computer readable medium.

Background

Named entity recognition is a basic task in natural language processing, and aims to recognize and classify components representing named entities in text, such as person names, place names, organizational names, and the like. Named entity recognition, which is the basic task, has a wide range of application scenarios. For example, named entity recognition can be used for constructing a knowledge graph, can be used for improving the accuracy of a search engine, and can also help a recommendation system to make recommendations more accurately. Therefore, the main optimization direction of the recent industry and academia with respect to the named entity recognition task is to increase the accuracy. With the rise of models based on pre-trained language models like BERT-CRF, the accuracy of named entity identification has been able to meet the requirements of application landing. However, the BERT-CRF model has a large number of parameters, which results in low efficiency of named entity recognition. This results in small companies with scarce resources having to increase their capital investment to meet the application requirements. In addition, for various intelligent hardware, the storage resources and the operation resources are less, so that the model is difficult to be deployed on the intelligent hardware.

At present, in the related art, the method for solving the problem of large parameter quantity of the named entity recognition model is divided into two ideas. The first idea is to train a model with a small number of parameters directly by using labeled samples, for example, train a single-layer BilSTM-CRF model; the second idea is to use a compressed pre-trained language model to design a new named entity recognition model, such as a named entity recognition model based on DistilBert or TinyBert. The effect of the first idea model is often poor and cannot meet the requirements of landing application, and the parameter quantity of the final model of the second idea is dependent on the parameter quantity of the used pre-training language model, so that the parameter quantity of the model cannot be well controlled, and the model is not flexible enough.

Aiming at the problems of large parameter quantity and low recognition efficiency of a BERT-CRF model, an effective solution is not provided at present.

Disclosure of Invention

The application provides a model training method, a device, equipment and a computer readable medium, which are used for solving the technical problems of large parameter quantity and low recognition efficiency of a BERT-CRF model.

According to an aspect of an embodiment of the present application, there is provided a model training method, including: inputting a training sample into a first model, and acquiring a first recognition result output by a full connection layer of the first model; inputting the training sample and the first recognition result into a second model, and obtaining a second recognition result output by a full connection layer of the second model, wherein the parameter quantity of the first model is greater than that of the second model, and the recognition accuracy of the first model is greater than that of the second model; constructing a target loss function by using the first recognition result, the second recognition result and the pre-labeled data of the training sample; and adjusting parameters in the second model by using the target loss function so that the identification accuracy of the second model reaches a target threshold value, and the identification is accurate when the output result of the second model is the same as the output result of the first model.

Optionally, inputting the training sample into the first model, and obtaining the first recognition result output by the fully-connected layer of the first model includes: inputting a training sample into a first model, and acquiring a first non-standardized probability distribution output by a full connection layer of the first model; the first non-normalized probability distribution is converted to a first normalized probability distribution using a normalization function, and the first recognition result includes the first normalized probability distribution.

Optionally, the inputting the training sample and the first recognition result into the second model, and obtaining the second recognition result output by the fully-connected layer of the second model includes: inputting the training sample and the first standardized probability distribution into a second model, and acquiring a second non-standardized probability distribution output by a full connection layer of the second model; and converting the second non-standardized probability distribution into a second standardized probability distribution by adopting a normalization function, and determining a training classification label of the training sample by utilizing the second standardized probability distribution, wherein the second recognition result comprises the second standardized probability distribution and the training classification label.

Optionally, constructing the target loss function by using the first recognition result, the second recognition result, and the pre-labeled data of the training sample includes: constructing a first loss function using the first normalized probability distribution and the second normalized probability distribution, the first loss function representing a cross entropy between the first normalized probability distribution and the second normalized probability distribution; constructing a second loss function by utilizing the pre-classification labels and the training classification labels of the training samples, wherein the second loss function represents the cross entropy between the pre-classification labels and the training classification labels, and the pre-labeling data comprises the pre-classification labels; and adding the first loss function and the second loss function to obtain a target loss function.

Optionally, constructing the target loss function using the first recognition result and the second recognition result further includes: traversing each word in the training sample, and determining a target loss function of each word according to the output result of the fully-connected layer of each word in the first model and the second model and the pre-classification label of each word; and adding the target loss functions of all the words in the training sample, and dividing the sum by the number of the words in the training sample to obtain the integral target loss function of the training sample.

Optionally, adjusting parameters in the second model using the target loss function to bring the recognition accuracy of the second model to the target threshold comprises: obtaining a loss value determined by a target loss function, wherein the loss value is used for representing the difference of the identification accuracy between the first model and the second model; and adjusting parameters of the second model by using the loss value until the output precision of the second model reaches a target threshold value.

Optionally, after adjusting parameters in the second model by using the target loss function so that the recognition accuracy of the second model reaches the target threshold, the method further includes performing migration set training as follows: inputting the migration data set into the first model, and acquiring a third recognition result output by a full connection layer of the first model, wherein the migration data set is a data set without data annotation; inputting the migration data set and the third recognition result into the second model, and acquiring a fourth recognition result output by a full connection layer of the second model; converting the third recognition result into a third standardized probability distribution by adopting a normalization function, and converting the fourth recognition result into a fourth standardized probability distribution by adopting the normalization function; constructing a soft label loss function by using the third standardized probability distribution and the fourth standardized probability distribution; and adjusting parameters of the second model by using the loss value determined by the soft label loss function until the output precision of the second model reaches the target threshold of the migration set.

Optionally, the parameters of the second model are adjusted by using the loss value determined by the soft label loss function until the output accuracy of the second model reaches the target threshold of the migration set, and the method further includes performing training set fine adjustment as follows: and under the condition of no first recognition result of the first model, inputting a training sample into the second model to train the second model so as to adjust the parameters of the second model, wherein the adjustment proportion of parameter adjustment on the second model under the condition of no first recognition result is smaller than the adjustment proportion of parameter adjustment on the second model under the condition of the first recognition result.

According to another aspect of the embodiments of the present application, there is provided a model training apparatus, including: the first training module is used for inputting a training sample into the first model and acquiring a first recognition result output by a full connection layer of the first model; the second training module is used for inputting the training sample and the first recognition result into a second model and acquiring a second recognition result output by a full connection layer of the second model, the parameter quantity of the first model is greater than that of the second model, and the recognition accuracy of the first model is greater than that of the second model; the loss function construction module is used for constructing a target loss function by utilizing the first recognition result, the second recognition result and the pre-labeled data of the training sample; and the parameter adjusting module is used for adjusting parameters in the second model by using the target loss function so that the identification accuracy of the second model reaches a target threshold value, and the identification is accurate when the output result of the second model is the same as the output result of the first model.

According to another aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, a communication interface, and a communication bus, where the memory stores a computer program executable on the processor, and the memory and the processor communicate with each other through the communication bus and the communication interface, and the processor implements the steps of the method when executing the computer program.

According to another aspect of embodiments of the present application, there is also provided a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the above-mentioned method.

Compared with the related art, the technical scheme provided by the embodiment of the application has the following advantages:

the technical scheme includes that a training sample is input into a first model, and a first recognition result output by a full connection layer of the first model is obtained; inputting the training sample and the first recognition result into a second model, and obtaining a second recognition result output by a full connection layer of the second model, wherein the parameter quantity of the first model is greater than that of the second model, and the recognition accuracy of the first model is greater than that of the second model; constructing a target loss function by using the first recognition result, the second recognition result and the pre-labeled data of the training sample; and adjusting parameters in the second model by using the target loss function so that the identification accuracy of the second model reaches a target threshold value, and the identification is accurate when the output result of the second model is the same as the output result of the first model. This application adopts the output of full nexine to carry out the knowledge distillation, reduces the output difference of two models through constructing the target loss function to make the less model of parameter quantity also can have higher output precision, and then promote recognition efficiency, solved the big problem that leads to recognition efficiency low of model parameter quantity.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the technical solutions in the embodiments or related technologies of the present application, the drawings needed to be used in the description of the embodiments or related technologies will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without any creative effort.

FIG. 1 is a diagram illustrating an alternative hardware environment for a model training method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of an alternative model training method provided in accordance with an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative first model according to an embodiment of the present application;

FIG. 4 is a diagram illustrating an alternative second model training effect provided in accordance with an embodiment of the present application;

FIG. 5 is a block diagram of an alternative model training apparatus provided in accordance with an embodiment of the present application;

fig. 6 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.

In the related art, the method for solving the problem of large parameter quantity of the named entity recognition model is divided into two ideas. The first idea is to train a model with a small number of parameters directly by using labeled samples, for example, train a single-layer BilSTM-CRF model; the second idea is to use a compressed pre-trained language model to design a new named entity recognition model, such as a named entity recognition model based on DistilBert or TinyBert. The effect of the first idea model is often poor and cannot meet the requirements of landing application, and the parameter quantity of the final model of the second idea is dependent on the parameter quantity of the used pre-training language model, so that the parameter quantity of the model cannot be well controlled, and the model is not flexible enough.

The BERT-CRF model has a large number of parameters, which results in low efficiency in named entity recognition. This results in small companies with scarce resources having to increase their capital investment to meet the application requirements. In addition, for various intelligent hardware, the storage resources and the operation resources are less, so that the model is difficult to be deployed on the intelligent hardware.

To address the problems mentioned in the background, according to an aspect of embodiments of the present application, an embodiment of a model training method is provided.

Alternatively, in the embodiment of the present application, the model training method may be applied to a hardware environment formed by the terminal 101 and the server 103 as shown in fig. 1. As shown in fig. 1, a server 103 is connected to a terminal 101 through a network, which may be used to provide services for the terminal or a client installed on the terminal, and a database 105 may be provided on the server or separately from the server, and is used to provide data storage services for the server 103, and the network includes but is not limited to: wide area network, metropolitan area network, or local area network, and the terminal 101 includes but is not limited to a PC, a cell phone, a tablet computer, and the like.

A model training method in this embodiment of the present application may be executed by the server 103, or may be executed by both the server 103 and the terminal 101, as shown in fig. 2, the method may include the following steps:

step S202, inputting a training sample into a first model, and acquiring a first recognition result output by a full connection layer of the first model;

step S204, inputting the training sample and the first recognition result into a second model, and acquiring a second recognition result output by a full connection layer of the second model, wherein the parameter quantity of the first model is greater than that of the second model, and the recognition accuracy of the first model is greater than that of the second model;

step S206, constructing a target loss function by using the first recognition result, the second recognition result and the pre-labeled data of the training sample;

and S208, adjusting parameters in the second model by using the target loss function so that the identification accuracy of the second model reaches a target threshold value, and the identification is accurate when the output result of the second model is the same as the output result of the first model.

The model training method provided by the embodiment of the application can be applied to named entity recognition. Named entity recognition is a fundamental task in natural language processing tasks, and aims to recognize and classify components representing named entities in text, such as names of people, places, organizations, time, numbers, and the like.

The tag system for named entity identification can adopt a BIO labeling system, namely, the position of an entity is represented by using 'B', 'I' and 'O' as prefixes, and the category of the entity is represented by using a self-defined tag. Specifically, "B" represents the starting position of the named entity, "I" represents the interior of the named entity, and "O" represents a non-named entity. The common named entity identification task needs to identify a person name, a place name and an organization name, and per (person), loc (location) and org (organization) can be used to represent corresponding entity classes, respectively. Taking this as an example, the entity tags have 7 categories, including: B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, O, as examples shown in Table 1.

TABLE 1

In an embodiment of the present application, the training sample is a sample subjected to category labeling in advance, the labeled content is the pre-labeled data, and if the training sample is x ═ (x ═ x₁，x₂，...，x_n) The pre-labeled data of the training sample is y ═ y₁，y₂，...，y_n) Wherein x is_iIs the ith input in the sequence, y_iIs x_iCorresponding class label, n is the length of the entire sequence. For example, with the labels of Table 1 aboveFor example, for a training sample, "the Sauter team coach Perela. The results of the pre-labeling are shown in Table 2.

TABLE 2

In the embodiment of the present application, the first model may be a BERT-CRF model. The BERT-CRF model is a named entity recognition model based on a pre-training language model BERT, and the structure of the model can be divided into four parts: input presentation layer, BERT coding layer, full connection layer, and CRF layer (Conditional Random Field). The overall structure is shown in fig. 3. The model parameters of BERT-CRF are large, so that the speed of named entity recognition is low, the application provides a model training method, the first model is used as a teacher model and the second model is used as a student model through knowledge distillation, the student model learns the output of the teacher model, the back propagation action of a target loss function is constructed to reduce the difference between the two models, and therefore the student model can realize high recognition accuracy with fewer parameters. The second model can be a BilSTM-CRF model, a single-layer BilSTM-CRF model or a multi-layer BilSTM-CRF model.

The knowledge distillation is carried out by adopting the output of the full connection layer, and the information obtained by the full connection layer is much more than that obtained by the CRF layer. For example, the "Sauter team coach, Perela, described above. "input as a training sample into the BERT-CRF model, the full connection layer output as shown in Table 3 can be obtained.

TABLE 3

B-PER	-2.615	-0.694	0.1716	0.0479	0.7058	1.4434	-0.745	-0.736	0.0831
										I-PER	-1.177	0.6383	-1.672	-0.645	0.1214	0.7381	2.1199	0.8464	-1.592
B-LOC	-0.943	-0.865	-1.37	-0.179	-0.299	-0.582	1.1152	-1.274	-0.021
										I-LOC	-0.128	-1.535	0.5447	-0.36	1.6132	-0.46	-0.662	0.2775	0.1261
B-ORG	0.7966	1.1623	-0.523	-0.285	0.048	1.3867	-2.218	0.2478	0.1132
										I-ORG	-0.366	1.8884	2.3072	-0.48	0.8712	0.1873	0.1633	0.0753	-1.036
O	0.0056	0.743	-0.369	0.7831	1.7283	0.3915	0.5212	0.3758	1.3439
											Sand	Specially for treating diabetes	Team	Teaching aid	Exercise machine	Wearing clothes	Thunder	Pulling device	。

Taking the "sand" word as an example, the full link layer output of the column corresponding to the "sand" word, i.e., the "sand" word, is [ -2.615, -1.177, -0.943, -0.128, 0.7966, -0.366, 0.0056], and since 0.7966 is the maximum value in the column, the prediction tag of the "sand" word is B-ORG, consistent with the pre-labeled category tag. The output of the CRF layer is [0, 0, 0, 0, 1, 0, 0] correspondingly, so that the output information of the full connection layer is far more than the output information of the CRF layer, the output of the CRF layer can only judge that the type of the 'sand' word is B-ORG, and the output of the full connection layer not only can judge that the type of the 'sand' word is B-ORG, but also can judge the confidence degree of the model to the type and the difference between the model and other types. Knowledge distillation is carried out by using the output of the full connecting layer, so that the student model can be trained more effectively.

Through the steps of S202 to S208, knowledge distillation can be performed by adopting the output of the full-connection layer, and the output difference of the two models is reduced by constructing the target loss function, so that the models with less parameter quantity can also have higher output precision, the recognition efficiency is improved, and the problem of low recognition efficiency caused by large parameter quantity of the models is solved.

Optionally, the step S202 of inputting the training sample into the first model, and obtaining the first recognition result output by the fully-connected layer of the first model includes: inputting a training sample into a first model, and acquiring a first non-standardized probability distribution output by a full connection layer of the first model; the first non-normalized probability distribution is converted to a first normalized probability distribution using a normalization function, and the first recognition result includes the first normalized probability distribution.

In the embodiment of the present application, training sample x ═ x (x)₁，x₂，...，x_n) Inputting a first model, the full-connected-layer output of which is a matrix of a first non-normalized probability distribution

Wherein x is_iIs the ith input in the sequence, n is the length of the input sequence, M is the number of classes of the named entity, z_ijElements representing the ith row and jth column of the matrix z, z_i·I-th row vector, z, representing the matrix z_·jThen is the jth column vector of z. As shown in table 3, which is a specific example of the matrix z output by the full link layer, each column of the matrix z corresponds to a word in the input text, and the value of the column is the non-normalized probability for classifying the word. The present application uses the softmax function as a normalization function to calculate a first non-normalized probability distribution, i.e., a first normalized probability distribution of the matrix z

The concrete formula is as follows:

where T is a scalar hyper-parameter used for smoothing and matrix s is the output of matrix z after softmax. The matrix s may be input into the second model as a first recognition result that "guides" the training of the second model.

Optionally, the step S204 of inputting the training sample and the first recognition result into the second model, and obtaining the second recognition result output by the fully-connected layer of the second model includes: inputting the training sample and the first standardized probability distribution into a second model, and acquiring a second non-standardized probability distribution output by a full connection layer of the second model; and converting the second non-standardized probability distribution into a second standardized probability distribution by adopting a normalization function, and determining a training classification label of the training sample by utilizing the second standardized probability distribution, wherein the second recognition result comprises the second standardized probability distribution and the training classification label.

In the embodiment of the present application, the training sample x is (x)₁，x₂，...，x_n) And the matrix s is input into a second model, and the full-connected-layer output of the second model is a matrix of a second non-normalized probability distribution

The second non-normalized probability distribution, i.e. the second normalized probability distribution s 'of the matrix z', is calculated using the softmax function as a normalization function, as follows:

and determining a training classification label y 'predicted by the second model on the training sample using the second normalized probability distribution s'. The second normalized probability distribution s 'and the training classification label y' predicted by the second model on the training sample are the second recognition result.

Optionally, the step S206 of constructing the target loss function by using the first recognition result, the second recognition result, and the pre-labeled data of the training sample includes: constructing a first loss function using the first normalized probability distribution and the second normalized probability distribution, the first loss function representing a cross entropy between the first normalized probability distribution and the second normalized probability distribution; constructing a second loss function by utilizing the pre-classification labels and the training classification labels of the training samples, wherein the second loss function represents the cross entropy between the pre-classification labels and the training classification labels, and the pre-labeling data comprises the pre-classification labels; and adding the first loss function and the second loss function to obtain a target loss function.

In the embodiment of the present application, a first loss function is constructed by using a first normalized probability distribution s obtained by a first model and a second normalized probability distribution s' obtained by a second model:

where CE represents a cross entropy loss function, the first normalized probability distribution is actually a prediction probability of the first model for the training samples, the second normalized probability distribution is actually a prediction probability of the second model for the training samples, and the probability distributions predicted by the models are collectively referred to as soft labels, so the first loss function is also a soft label loss function.

In the embodiment of the present application, the pre-labeled data (manually labeled classification label) y of the training text is (y)₁，y₂，...，y_n) And the training classification label y' obtained by the second model constructs a second loss function:

where CE represents the cross entropy loss function, y_jIs the real label, y ', of the jth word (word)'_jThen it is the label that the second model predicted for the jth word. Since the pre-labeled data is labeled manually, and is called a hard tag, the second loss function is also a hard tag loss function.

In the embodiment of the present application, the first loss function and the second loss function are added to obtain a final target loss function, which is also referred to as a distillation loss function:

l＝l^(soft)+l^(hard)

optionally, the step S206 of constructing the objective loss function by using the first recognition result and the second recognition result further includes: traversing each word in the training sample, and determining a target loss function of each word according to the output result of the fully-connected layer of each word in the first model and the second model and the pre-classification label of each word; and adding the target loss functions of all the words in the training sample, and dividing the sum by the number of the words in the training sample to obtain the integral target loss function of the training sample.

In the embodiment of the present application, the named entity recognition problem can be regarded as classifying each word (word) in the sequence. Thus, the target loss function for each word may be calculated and then the loss function values for all words in the sequence may be averaged, thus obtaining the target loss function for the entire sequence.

Optionally, the step S208 of adjusting parameters in the second model using the target loss function so that the recognition accuracy of the second model reaches the target threshold includes: obtaining a loss value determined by a target loss function, wherein the loss value is used for representing the difference of the identification accuracy between the first model and the second model; and adjusting parameters of the second model by using the loss value until the output precision of the second model reaches a target threshold value.

In the embodiment of the application, the loss value determined by the loss function is used for back propagation to adjust the parameters of the second model, so that the output result of the second model is gradually consistent with the output result of the first model, the identification accuracy of the second model with less parameter quantity is improved to be consistent with the first model with large parameter quantity, namely, the target threshold value is reached, and the problem of low identification efficiency caused by large parameter quantity of the BERT-CRF model is finally solved.

The embodiment provides a knowledge-based distillation model training method, which can improve the recognition accuracy of a second model with a small parameter quantity to be consistent with a first model with a large parameter quantity, and finally solves the problem of low recognition efficiency caused by a large parameter quantity of a BERT-CRF model. In order to further improve the distillation effect, the present application also proposes a three-stage knowledge distillation method, which includes training set distillation, migration set distillation and training set trimming, wherein the above embodiment is the first stage of the three stages, i.e. training set distillation, and the migration set distillation and training set trimming are described in detail below.

Optionally, after adjusting parameters in the second model by using the target loss function so that the recognition accuracy of the second model reaches the target threshold, the method further includes performing migration set training as follows:

step 1, inputting a migration data set into a first model, and acquiring a third identification result output by a full connection layer of the first model, wherein the migration data set is a data set without data labeling;

step 2, inputting the migration data set and the third recognition result into a second model, and acquiring a fourth recognition result output by a full connection layer of the second model;

step 3, converting the third recognition result into a third standardized probability distribution by adopting a normalization function, and converting the fourth recognition result into a fourth standardized probability distribution by adopting the normalization function;

step 4, constructing a soft label loss function by utilizing the third standardized probability distribution and the fourth standardized probability distribution;

and 5, adjusting parameters of the second model by using the loss value determined by the soft label loss function until the output precision of the second model reaches the target threshold of the migration set.

In the embodiment of the application, since the original training set (the training samples) may be relatively small, the knowledge distillation method cannot be fully utilized to compress the first model into the second model, and therefore migration set distillation is introduced. The flow of migration set distillation is substantially the same as the flow of training set distillation, except that the migration data set is a data set that has not been manually labeled in advance, so the loss function includes only the soft label loss function l^(soft)Not including the hard tag loss function l^(hard). The soft label loss function of migration set distillation is constructed by utilizing the third standardized probability distribution obtained by the first model full-connection layer and the fourth standardized probability distribution obtained by the second model full-connection layer, and finally, the soft label loss is utilizedAnd (3) performing back propagation on the loss value determined by the loss function to adjust the parameters of the second model, so that the output result of the second model is gradually consistent with the output result of the first model, namely the migration set target threshold is reached, and the identification accuracy of the second model with less parameters is improved to be consistent with the first model with large parameters.

Optionally, the parameters of the second model are adjusted by using the loss value determined by the soft label loss function until the output accuracy of the second model reaches the target threshold of the migration set, and the method further includes performing training set fine adjustment as follows:

and under the condition of no first recognition result of the first model, inputting a training sample into the second model to train the second model so as to adjust the parameters of the second model, wherein the adjustment proportion of parameter adjustment on the second model under the condition of no first recognition result is smaller than the adjustment proportion of parameter adjustment on the second model under the condition of the first recognition result.

In the embodiment of the present application, the migration data set used in migration set distillation is theoretically distributed similarly to the original training set (the training sample), but in practical situations, it is difficult to find a migration data set which has a large data volume and is distributed similarly to the original training set, so after the migration data set with a large distribution difference is used as the migration data set to perform migration set distillation, a third stage, i.e., training set fine tuning, is required to adapt the second model to the target domain of named entity recognition.

In the embodiment of the present application, in the process of training set fine tuning, only the original training set, that is, the training sample x is equal to (x)₁,x₂,…,x_n) And inputting the second model without inputting the first recognition result of the first model, so that the second model is simply trained by using the original training set alone to fine-tune parameters of the second model so as to adapt the second model to the target domain.

Through the knowledge distillation in the three stages, the identification accuracy of the second model with less parameter quantity can be effectively improved to be consistent with that of the first model with large parameter quantity, and the problem of low identification efficiency caused by large parameter quantity of the BERT-CRF model is finally solved. The actual effect of the second model is slightly different when a single-layer, double-layer and three-layer BilSTM-CRF model is used, as shown in FIG. 4, it can be seen that the recognition accuracy of the second model can be improved to about 0.75 through training set distillation, and is reduced to about 0.7 through migration set distillation, which is caused by the large distribution difference between the migration set and the original training set, but the recognition accuracy of the second model can be further improved to 0.775 through training set fine tuning. Therefore, the model training method provided by the application can improve the recognition accuracy of the second model with less parameter quantity to be consistent with the first model with large parameter quantity through three-stage knowledge distillation, and finally solves the problem of low recognition efficiency caused by large parameter quantity of the BERT-CRF model.

According to still another aspect of an embodiment of the present application, as shown in fig. 5, there is provided a model training apparatus including:

the first training module 501 is configured to input a training sample into a first model, and obtain a first recognition result output by a full connection layer of the first model;

the second training module 503 is configured to input the training sample and the first recognition result into the second model, and obtain a second recognition result output by the full connection layer of the second model, where a parameter quantity of the first model is greater than a parameter quantity of the second model, and a recognition accuracy of the first model is greater than a recognition accuracy of the second model;

a loss function constructing module 505, configured to construct a target loss function by using the first recognition result, the second recognition result, and the pre-labeled data of the training sample;

and a parameter adjusting module 507, configured to adjust parameters in the second model by using the target loss function, so that the recognition accuracy of the second model reaches a target threshold, and when the output result of the second model is the same as the output result of the first model, the recognition is accurate.

It should be noted that the first training module 501 in this embodiment may be configured to execute step S202 in this embodiment, the second training module 503 in this embodiment may be configured to execute step S204 in this embodiment, the loss function constructing module 505 in this embodiment may be configured to execute step S206 in this embodiment, and the parameter adjusting module 507 in this embodiment may be configured to execute step S208 in this embodiment.

It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments. It should be noted that the modules described above as a part of the apparatus may operate in a hardware environment as shown in fig. 1, and may be implemented by software or hardware.

Optionally, the first training module is specifically configured to: inputting a training sample into a first model, and acquiring a first non-standardized probability distribution output by a full connection layer of the first model; the first non-normalized probability distribution is converted to a first normalized probability distribution using a normalization function, and the first recognition result includes the first normalized probability distribution.

Optionally, the second training module is specifically configured to: inputting the training sample and the first standardized probability distribution into a second model, and acquiring a second non-standardized probability distribution output by a full connection layer of the second model; and converting the second non-standardized probability distribution into a second standardized probability distribution by adopting a normalization function, and determining a training classification label of the training sample by utilizing the second standardized probability distribution, wherein the second recognition result comprises the second standardized probability distribution and the training classification label.

Optionally, the loss function constructing module is specifically configured to: constructing a first loss function using the first normalized probability distribution and the second normalized probability distribution, the first loss function representing a cross entropy between the first normalized probability distribution and the second normalized probability distribution; constructing a second loss function by utilizing the pre-classification labels and the training classification labels of the training samples, wherein the second loss function represents the cross entropy between the pre-classification labels and the training classification labels, and the pre-labeling data comprises the pre-classification labels; and adding the first loss function and the second loss function to obtain a target loss function.

Optionally, the loss function constructing module is further configured to: traversing each word in the training sample, and determining a target loss function of each word according to the output result of the fully-connected layer of each word in the first model and the second model and the pre-classification label of each word; and adding the target loss functions of all the words in the training sample, and dividing the sum by the number of the words in the training sample to obtain the integral target loss function of the training sample.

Optionally, the parameter adjusting module is specifically configured to: obtaining a loss value determined by a target loss function, wherein the loss value is used for representing the difference of the identification accuracy between the first model and the second model; and adjusting parameters of the second model by using the loss value until the output precision of the second model reaches a target threshold value.

Optionally, the apparatus further comprises a migration set training module configured to: inputting the migration data set into the first model, and acquiring a third recognition result output by a full connection layer of the first model, wherein the migration data set is a data set without data annotation; inputting the migration data set and the third recognition result into the second model, and acquiring a fourth recognition result output by a full connection layer of the second model; converting the third recognition result into a third standardized probability distribution by adopting a normalization function, and converting the fourth recognition result into a fourth standardized probability distribution by adopting the normalization function; constructing a soft label loss function by using the third standardized probability distribution and the fourth standardized probability distribution; and adjusting parameters of the second model by using the loss value determined by the soft label loss function until the output precision of the second model reaches the target threshold of the migration set.

Optionally, the apparatus further comprises a training set fine tuning module configured to: and under the condition of no first recognition result of the first model, inputting a training sample into the second model to train the second model so as to adjust the parameters of the second model, wherein the adjustment proportion of parameter adjustment on the second model under the condition of no first recognition result is smaller than the adjustment proportion of parameter adjustment on the second model under the condition of the first recognition result.

According to another aspect of the embodiments of the present application, there is provided an electronic device, as shown in fig. 6, including a memory 601, a processor 603, a communication interface 605 and a communication bus 607, where a computer program operable on the processor 603 is stored in the memory 601, the memory 601 and the processor 603 communicate with each other through the communication interface 605 and the communication bus 607, and the steps of the method are implemented when the processor 603 executes the computer program.

The memory and the processor in the electronic equipment are communicated with the communication interface through a communication bus. The communication bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

There is also provided, in accordance with yet another aspect of an embodiment of the present application, a computer-readable medium having non-volatile program code executable by a processor.

Optionally, in an embodiment of the present application, a computer readable medium is configured to store program code for the processor to perform the following steps:

inputting a training sample into a first model, and acquiring a first recognition result output by a full connection layer of the first model;

inputting the training sample and the first recognition result into a second model, and obtaining a second recognition result output by a full connection layer of the second model, wherein the parameter quantity of the first model is greater than that of the second model, and the recognition accuracy of the first model is greater than that of the second model;

constructing a target loss function by using the first recognition result, the second recognition result and the pre-labeled data of the training sample;

and adjusting parameters in the second model by using the target loss function so that the identification accuracy of the second model reaches a target threshold value, and the identification is accurate when the output result of the second model is the same as the output result of the first model.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.

When the embodiments of the present application are specifically implemented, reference may be made to the above embodiments, and corresponding technical effects are achieved.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk. It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of model training, comprising:

inputting a training sample and the first recognition result into a second model, and obtaining a second recognition result output by a full connection layer of the second model, wherein the parameter quantity of the first model is greater than that of the second model, and the recognition accuracy of the first model is greater than that of the second model;

and adjusting parameters in the second model by using the target loss function so as to enable the identification accuracy of the second model to reach a target threshold value, wherein the identification accuracy is realized when the output result of the second model is the same as the output result of the first model.

2. The method of claim 1,

inputting a training sample into a first model, and obtaining a first recognition result output by a full connection layer of the first model, wherein the first recognition result comprises:

inputting the training sample into the first model, and acquiring a first non-standardized probability distribution output by a full connection layer of the first model; converting the first non-normalized probability distribution to a first normalized probability distribution using a normalization function, wherein the first recognition result includes the first normalized probability distribution;

inputting a training sample and the first recognition result into a second model, and obtaining a second recognition result output by a full connection layer of the second model, wherein the second recognition result comprises:

inputting the training sample and the first normalized probability distribution into the second model, and obtaining a second non-normalized probability distribution output by a full connection layer of the second model; and converting the second non-normalized probability distribution into a second normalized probability distribution by using the normalization function, and determining a training classification label of the training sample by using the second normalized probability distribution, wherein the second recognition result comprises the second normalized probability distribution and the training classification label.

3. The method of claim 2, wherein constructing an objective loss function using the first recognition result, the second recognition result, and pre-labeled data of the training samples comprises:

constructing a first loss function using the first normalized probability distribution and the second normalized probability distribution, wherein the first loss function represents a cross entropy between the first normalized probability distribution and the second normalized probability distribution;

constructing a second loss function by using the pre-classification label of the training sample and the training classification label, wherein the second loss function represents the cross entropy between the pre-classification label and the training classification label, and the pre-labeling data comprises the pre-classification label;

and adding the first loss function and the second loss function to obtain the target loss function.

4. The method of claim 3, wherein constructing an objective loss function using the first recognition result and the second recognition result further comprises:

traversing each word in the training sample, and determining the target loss function of each word according to the output result of the fully-connected layer of each word in the first model and the second model and the pre-classification label of each word;

and adding the target loss functions of all words in the training sample, and dividing the sum by the number of the words in the training sample to obtain the target loss function of the whole training sample.

5. The method of claim 4, wherein adjusting parameters in the second model using the target loss function to achieve a target threshold for recognition accuracy of the second model comprises:

obtaining a loss value determined by the target loss function, wherein the loss value is used for representing the difference of the identification accuracy between the first model and the second model;

and adjusting parameters of the second model by using the loss value until the output precision of the second model reaches the target threshold value.

6. The method of any of claims 1 to 5, wherein after adjusting parameters in the second model using the target loss function to achieve a target threshold for recognition accuracy of the second model, the method further comprises performing migration set training as follows:

inputting a migration data set into the first model, and acquiring a third identification result output by a full connection layer of the first model, wherein the migration data set is a data set without data labeling;

inputting the migration data set and the third recognition result into the second model, and acquiring a fourth recognition result output by a full connection layer of the second model;

converting the third recognition result into a third normalized probability distribution by using a normalization function, and converting the fourth recognition result into a fourth normalized probability distribution by using the normalization function;

constructing a soft tag loss function using the third normalized probability distribution and the fourth normalized probability distribution;

and adjusting parameters of the second model by using the loss value determined by the soft label loss function until the output precision of the second model reaches a migration set target threshold value.

7. The method of claim 6, wherein the parameters of the second model are adjusted using the loss values determined by the soft label loss function until after the output accuracy of the second model reaches a migration set target threshold, the method further comprising training set trimming as follows:

and under the condition of no first recognition result of the first model, inputting the training sample into the second model to train the second model so as to adjust the parameters of the second model, wherein the adjustment proportion of parameter adjustment on the second model under the condition of no first recognition result is smaller than the adjustment proportion of parameter adjustment on the second model under the condition of the first recognition result.

8. A model training apparatus, comprising:

the first training module is used for inputting a training sample into a first model and acquiring a first recognition result output by a full connection layer of the first model;

the second training module is used for inputting a training sample and the first recognition result into a second model and acquiring a second recognition result output by a full connection layer of the second model, wherein the parameter quantity of the first model is greater than that of the second model, and the recognition accuracy of the first model is greater than that of the second model;

the loss function construction module is used for constructing a target loss function by utilizing the first recognition result, the second recognition result and the pre-labeled data of the training sample;

and the parameter adjusting module is used for adjusting parameters in the second model by using the target loss function so as to enable the identification accuracy of the second model to reach a target threshold value, wherein the identification is accurate when the output result of the second model is the same as the output result of the first model.

9. An electronic device comprising a memory, a processor, a communication interface and a communication bus, wherein the memory stores a computer program operable on the processor, and the memory and the processor communicate via the communication bus and the communication interface, wherein the processor implements the steps of the method according to any of the claims 1 to 7 when executing the computer program.

10. A computer-readable medium having non-volatile program code executable by a processor, wherein the program code causes the processor to perform the method of any of claims 1 to 7.