CN110852426B - Pre-training model integration acceleration method and device based on knowledge distillation - Google Patents

Pre-training model integration acceleration method and device based on knowledge distillation Download PDF

Info

Publication number
CN110852426B
CN110852426B CN201911134079.6A CN201911134079A CN110852426B CN 110852426 B CN110852426 B CN 110852426B CN 201911134079 A CN201911134079 A CN 201911134079A CN 110852426 B CN110852426 B CN 110852426B
Authority
CN
China
Prior art keywords
model
likelihood estimation
teacher
student
estimation probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911134079.6A
Other languages
Chinese (zh)
Other versions
CN110852426A (en
Inventor
宋子文晗
江岭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Xiaoduo Technology Co ltd
Original Assignee
Chengdu Xiaoduo Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Xiaoduo Technology Co ltd filed Critical Chengdu Xiaoduo Technology Co ltd
Priority to CN201911134079.6A priority Critical patent/CN110852426B/en
Publication of CN110852426A publication Critical patent/CN110852426A/en
Application granted granted Critical
Publication of CN110852426B publication Critical patent/CN110852426B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a pre-training model integration acceleration method and a device based on knowledge distillation, wherein the device uses the method and comprises the steps of defining a teacher model group and a student model; inputting training data labeled with classification labels into a teacher model group and a student model for training, and outputting likelihood estimation probability values corresponding to each teacher model and likelihood estimation probability values of the student models; pooling likelihood estimation probability values output by the teacher model group, and outputting the pooled likelihood estimation probability values; measuring a difference value between the likelihood estimation probability value of the teacher model group after pooling and the likelihood estimation probability value of the student model; updating the parameters of the student models to finally obtain the student models with likelihood estimation probability values closest to the likelihood estimation probability values after the teacher model clustering; and (3) using the obtained feature extractor and feature encoder of the student model as a student pre-training model to predict data to be trained, and encoding the data into a data feature vector.

Description

Pre-training model integration acceleration method and device based on knowledge distillation
Technical Field
The invention belongs to the technical field of neural network data processing, and particularly relates to a pre-training model integration acceleration method and device based on knowledge distillation.
Background
In recent years, convolutional neural networks have achieved tremendous success in related tasks in the computer vision field, such as face detection, image classification, natural language processing, and the like. For example, yann LeCun et al, university of New York, proposed the use of a multi-layer convolutional neural network for handwriting recognition, and the Hinton team used the deep neural network to obtain overwhelming winnings in ImageNet image classification games.
With the development of the convolutional neural network, the hierarchical structure design is more and more complex, the number of network parameters is more and more, and correspondingly, a training data set required for training an excellent convolutional neural network is more huge. Therefore, the time and space complexity and the storage cost in the operation process are greatly increased, and the existing large convolutional neural network depends on a high-performance processor and a cluster server with extremely strong operation capability. The huge computation load, time consumption and energy consumption make the convolutional neural network difficult to be deployed on mobile devices with limited computing resources and energy storage, such as mobile phones and intelligent wearable devices. Therefore, compressing the parameters of the large neural network and reducing the computational complexity are important research directions.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a pre-training model integration acceleration method and device based on knowledge distillation, in the method, a model with huge operand is used as a teacher model, the likelihood estimation probability values of all the teacher models in a teacher model group are subjected to pooling operation, and the estimation results of different teacher models are summarized, so that the classification probability of data is more accurate, and the comprehension capability of the data is further improved; comparing likelihood estimation probability after pooling of a teacher model group with likelihood estimation probability of a student model to obtain a difference value between the likelihood estimation probability and the teacher model, updating the student model according to the difference value to obtain the student model with the likelihood estimation probability value closest to the likelihood estimation probability value after pooling of the teacher model group, taking a feature extractor and a feature encoder of the obtained student model as a student pre-training model, and transferring a large amount of learned knowledge of the teacher model and knowledge understanding modes of the knowledge to the student model through the process of updating the student model so as to ensure the effect of a complex teacher model and the speed of training data identification in a real scene; the obtained pre-training model of the student encodes data to be trained into data characteristic vectors, can be applied to different processing tasks, can be repeatedly applied after one-time processing, and reduces the operation complexity.
In order to achieve the above purpose, the solution adopted by the invention is as follows: the integrated acceleration method of the pre-training model based on knowledge distillation comprises the following steps:
defining a teacher model group, wherein the teacher model group comprises a plurality of teacher models, each teacher model comprises a first feature extractor, a first feature encoder and a first classifier, and the first feature extractor comprises a convolutional network feature extractor and the combination of a long-short term memory network feature extractor and a convolutional network feature extractor; defining a student model, wherein the student model comprises a second feature extractor, a second feature encoder and a second classifier; the teacher model group comprises a large number of teacher models which are trained and have excellent identification capability, a first feature extractor and a first encoder of each teacher model are trained and iterated through labeled training data, and a second feature extractor and a second feature encoder of each student model are original and untrained.
Respectively inputting the training data labeled with the classification labels into a teacher model group and a student model for training, and outputting likelihood estimation probability values by the student model; the teacher model group outputs likelihood estimation probability values corresponding to all the teacher models; the teacher model group outputs the output result of each teacher model, and the results can complement the judgment results of the certificate pair classification, so that the judgment error is reduced, and the prediction accuracy is improved.
Performing pooling operation on the likelihood estimation probability values output by the teacher model group, and outputting the pooled likelihood estimation probability values, wherein the pooling operation comprises averaging operation and weighting averaging operation; the averaging operation comprises: averaging likelihood estimation probability values corresponding to all teacher models output by the teacher model group; the weighted averaging operation comprises: weighting likelihood estimation probability values corresponding to all teacher models output by the teacher model group, then averaging, and smoothing errors caused by a single teacher model by calculating the average value or weighted average value of all the teacher model output probabilities, thereby improving the prediction accuracy;
measuring a difference value between the likelihood estimation probability value of the teacher model group after pooling and the likelihood estimation probability value of the student model;
updating the parameters of the student model by adopting gradient descent algorithm calculation, so that the likelihood estimation probability value of the student model iterates to the likelihood estimation probability value of the teacher model group after pooling, and finally obtaining the student model of which the likelihood estimation probability value is closest to the likelihood estimation probability value of the teacher model group after pooling;
taking the feature extractor and the feature encoder of the obtained student model as a student pre-training model;
the student pre-training model predicts the data to be trained, encodes the data into a data feature vector, and provides the data feature vector to downstream applications such as classification, clustering and matching.
The difference between the likelihood estimation probability value of the teacher model group after pooling and the likelihood estimation probability value of the student model is measured by adopting a cross entropy loss function or KL divergence (measuring the distance between two probability distributions).
The device for applying the pre-training model integration acceleration method based on knowledge distillation comprises a teacher model group, a likelihood estimation pooling device, a student model, a knowledge distillation device and a student pre-training model;
the teacher model group comprises a plurality of teacher models and is used for training the training data labeled with the classification labels to obtain likelihood estimation probability values corresponding to the teacher models, the teacher models can be the same or different, and the likelihood estimation probability values formed by pooling the teacher model group consisting of different teacher models are beneficial to the migration of student models to more knowledge and understanding modes of knowledge;
the student model is used for training the training data marked with the classification labels to obtain likelihood estimation probability values corresponding to the student model;
the likelihood estimation pooling device is used for pooling likelihood estimation probability values output by the teacher model group and outputting the pooled likelihood estimation probability values, the likelihood estimation pooling device collects prediction results of a plurality of teacher models in the teacher model group, errors caused by a single model are smoothed in a mode of calculating the average value of the output probabilities of all the teacher models, and the prediction accuracy is improved;
the knowledge distilling device is used for measuring a difference value between the likelihood estimation probability value of the teacher model group after pooling and the likelihood estimation probability value of the student model, and updating parameters of the student model to obtain the student model of which the likelihood estimation probability value is closest to the likelihood estimation probability value of the teacher model group after pooling;
the student pre-training model is used for a feature extractor and a feature encoder which comprise the obtained student model, and is used for encoding data to be trained into data feature vectors. The data characteristic vectors processed by the student pre-training model can be applied to different processing tasks, and can be repeatedly applied after one-time processing, so that the operation complexity is reduced. For example, the clustering device is used for clustering; classifying the application in a classifier; the method is applied to a matcher for matching.
The teacher model comprises a first feature extractor, a first feature encoder and a first classifier.
The student model comprises a second feature extractor, a second feature encoder and a second classifier.
The invention has the beneficial effects that:
(1) The method takes a model with huge computation amount as a teacher model, performs pooling operation on the likelihood estimation probability value of each teacher model in a teacher model group, and sums up the estimation results of different teacher models, so that the classification probability of data is more accurate, and the comprehension capability of the data is further improved; comparing likelihood estimation probability after pooling of a teacher model group with likelihood estimation probability of a student model to obtain a difference value between the likelihood estimation probability and the teacher model, updating the student model according to the difference value to obtain the student model with the likelihood estimation probability value closest to the likelihood estimation probability value after pooling of the teacher model group, taking a feature extractor and a feature encoder of the obtained student model as a student pre-training model, and transferring a large amount of learned knowledge of the teacher model and knowledge understanding modes of the knowledge to the student model through the process of updating the student model so as to ensure the effect of a complex teacher model and the speed of training data identification in a real scene; the obtained pre-training model of the student encodes data to be trained into data characteristic vectors, can be applied to different processing tasks, can be repeatedly applied after one-time processing, and reduces the operation complexity.
Drawings
FIG. 1 is a flow chart of a pre-training model integration acceleration method of the present invention;
FIG. 2 is a diagram of an integrated acceleration device for a pre-training model according to the present invention;
FIG. 3 is a diagram of an intelligent customer service pre-training model integrated accelerator according to an embodiment of the present invention;
in the figure, 100-teacher model clique, 110-teacher model, 111-first feature extractor, 111A-long and short term memory network feature extractor, 111B-convolutional network feature extractor, 112-first feature encoder, 1112B-linear feature encoder, 113-first classifier, 200-student model, 210-second feature extractor, 211-convolutional network feature extractor, 220-second feature encoder, 221-linear feature encoder, 230-second classifier, 300-likelihood estimation pooling device, 400-knowledge distilling device, 500-student pre-training model, 510-feature extractor, 511-convolutional network feature extractor, 520-feature encoder, 521-linear feature encoder.
Detailed Description
The invention is further described below with reference to the accompanying drawings:
as shown in FIG. 1, the method for integrating and accelerating the pre-training model based on knowledge distillation comprises the following steps:
defining a teacher model clique 100, wherein the teacher model clique 100 comprises a plurality of teacher models 110, each teacher model 110 comprises a first feature extractor 111, a first feature encoder 112 and a first classifier 113, and the first feature extractor 111 comprises a convolution network feature extractor and a combination of a long-short term memory network feature extractor and a convolution network feature extractor; defining a student model 200, wherein the student model 200 comprises a second feature extractor 210, a second feature encoder 220 and a second classifier 230; the teacher model clique 100 includes a plurality of teacher models 110 that have been trained and have excellent recognition capability, the first feature extractor 111 and the first encoder 112 of the teacher models 110 are trained and iterated through labeled training data, and the student models 200 use the original, untrained second feature extractor 210 and the second feature encoder 220.
Respectively inputting the training data labeled with the classification labels into a teacher model group 100 and a student model 200 for training, wherein the student model 200 directly calculates and outputs likelihood estimation probability values according to original parameters of a second feature extractor 210 and a second feature encoder 220 without iteration; the teacher model clique 100 outputs the likelihood estimation probability value corresponding to each teacher model 110, and the teacher models 110 do not need to iterate, and directly calculate the likelihood estimation probability value of the training data in the teacher model; the teacher model group 100 outputs the output result of each teacher model 110, and the results can complement the judgment result of the mapping pair classification, thereby reducing the judgment error and improving the accuracy of prediction.
Pooling the likelihood estimation probability values output by the teacher model group 100, and outputting the pooled likelihood estimation probability values, wherein the pooling comprises averaging and weighting averaging; the averaging operation comprises: averaging likelihood estimation probability values corresponding to each teacher model 110 output by the teacher model clique 100; the weighted averaging operation comprises: weighting likelihood estimation probability values corresponding to each teacher model 110 output by the teacher model group 100, then averaging, and smoothing errors caused by a single teacher model 110 by calculating an average value or weighted average value of the probabilities output by all the teacher models 110, thereby improving the accuracy of prediction;
measuring a difference value between the likelihood estimation probability value of the teacher model group 100 after pooling and the likelihood estimation probability value of the student model 200;
updating the parameters of the student model 200 by adopting gradient descent algorithm calculation, so that the likelihood estimation probability values of the student model 200 are iterated to the likelihood estimation probability values of the teacher model group 100 after pooling, and finally obtaining the student model 200 of which the likelihood estimation probability values are closest to the likelihood estimation probability values of the teacher model group 100 after pooling;
taking the feature extractor 210 and the feature encoder 220 of the obtained student model 200 as a student pre-training model 500;
the student pre-training model 500 predicts the data to be trained, encodes the data into data feature vectors, and provides the data feature vectors to downstream applications such as classification, clustering, and matching.
The difference between the likelihood estimation probability value of the teacher model group 100 after pooling and the likelihood estimation probability value of the student model 200 is measured by adopting a cross entropy loss function or KL divergence (measuring the distance between two probability distributions).
As shown in fig. 2, the apparatus for applying the knowledge distillation-based pre-training model integrated acceleration method includes a teacher model group 100, a likelihood estimation pooling device 300, a student model 200, a knowledge distillation apparatus 400, and a student pre-training model 500;
the teacher model group 100 comprises a plurality of teacher models 110, and is used for training data labeled with classification labels to obtain likelihood estimation probability values corresponding to the teacher models 110, the teacher models 110 may be the same or different, and the likelihood estimation probability values after the teacher model group 100 formed by the different teacher models 110 is pooled are beneficial for the student models 200 to transfer to more knowledge and understanding modes of the knowledge;
the student model 200 is used for training the training data labeled with the classification labels to obtain likelihood estimation probability values corresponding to the student model 200;
the likelihood estimation pooling device 300 is used for pooling likelihood estimation probability values output by the teacher model group 100 and outputting the pooled likelihood estimation probability values, the likelihood estimation pooling device 300 collects the prediction results of a plurality of teacher models 110 in the teacher model group 100, errors caused by a single model are smoothed by calculating the average value of the output probabilities of all the teacher models 110, and the prediction accuracy is improved;
the knowledge distilling device 400 is used for measuring the difference value between the likelihood estimation probability value of the teacher model group 100 after pooling and the likelihood estimation probability value of the student model 200, and updating the parameters of the student model 200 to obtain the student model 200 of which the likelihood estimation probability value is closest to the likelihood estimation probability value of the teacher model group 100 after pooling;
the student pre-training model 500 includes a feature extractor 510 and a feature encoder 520 of the obtained student model 200, and is used for encoding data to be trained into a data feature vector. The data characteristic vectors processed by the student pre-training model can be applied to different processing tasks, and can be repeatedly applied after one-time processing, so that the operation complexity is reduced. For example, the method is applied to a clustering device to perform clustering; the application is classified in a classifier; the method is applied to a matcher for matching.
The teacher model 110 includes a first feature extractor 111, a first feature encoder 112, and a first classifier 113.
The student model 200 includes a second feature extractor 210, a second feature encoder 220, and a second classifier 230.
Example one
With the rapid development of the e-commerce industry, online shopping becomes an indispensable daily routine for most people, and in the online shopping process, consumers often consult merchants about the performance of products, the services of the merchants, the proper size of the products and other problems, so that the merchants on various e-commerce platforms need to recruit a large number of customer service staff to ask customers for questions and solve confusion, and the increasing consulting amount gradually increases the demands of the merchants on customer service robots. In the field of intelligent customer service, intention identification is an important task, aiming at understanding problems issued by buyers in a customer service scene. And then, carrying out subsequent related operation or reply aiming at the identified buyer intention.
The existing common natural language understanding model is a long-short term memory network (LSTM), is a time-series type calculation, has low parallelism degree, and is difficult to reduce the calculation amount through strategies such as pruning and the like. Recently emerging pre-trained models for natural language understanding (BERT, XLNet, etc.) all have hundreds of millions of parameters, high computational complexity, high requirements on computing devices, and long response times. Simple models (CNN, transformer, etc.) perform poorly after pre-training.
Meanwhile, as the consulting amount is increased continuously, the computing demand of the intention identification module on the computing device is expanded continuously, and the response time of the request becomes a bottleneck. Therefore, it is necessary to reduce the computation requirement, increase the response speed, reduce the complexity of the model, and increase the computation parallelism.
Fig. 3 is a schematic diagram of an intelligent customer service pre-training model integrated acceleration device to which the pre-training model integrated acceleration method of the present application is applied. The intelligent customer service pre-training model integrated accelerating device comprises a teacher model group 100, a student model 200, a likelihood estimation pooling device 300, a knowledge distilling device 400 and a student pre-training model 500.
The teacher model group 100 includes a plurality of teacher models 110, the teacher models 110 include a Long-short Term Memory Network feature extractor 111A, a convolutional Network feature extractor 111B, a linear feature encoder 112B, and a first classifier 113, and the Long-short Term Memory Network feature extractor 111A (Long-short Term Memory Network) is a time-series logic computing Network and can extract the context relationship features of the text. The convolutional network feature extractor 111B may extract local features of more texts, and obtain relationship features of words and words-forming words. The linear feature encoder 112B further compression encodes and transforms the feature space of the feature vector upstream for faster and better operation later. The first classifier 113 contains a softmax function, gives scores for different classifications based on upstream feature codes, and calculates the probability that the input customer question corresponds to each classification. The teacher models 110 are models having a large parameter amount and high computational complexity, and have excellent intention recognition capability through training of a large number of client questions and corresponding intention labels in various industries (clothing, shoe bags, electric appliances, daily necessities, and the like). The word vectors of the multi-industry client problems are firstly subjected to sufficient context feature information through a long-term and short-term memory network feature extractor 111A which is complex in calculation and cannot be subjected to parallel calculation in a teacher model 110, then the local features, words and word composition word relation features of more texts are extracted through convolution, the features are judged through a first classifier 113 after being encoded by a linear feature encoder 112A, and the input client problems correspond to the probability of each classification.
The likelihood estimation pooling device 300 obtains the probabilities output by each teacher model 110 in the teacher model group 100 to perform pooling operation, and smoothes errors caused by a single model by calculating an average value of the probabilities output by all the teacher models 110, thereby improving the prediction accuracy.
The student model 200 comprises a convolutional network feature extractor 211, a linear feature encoder 221 and a second classifier 230, the student model 200 performs convolutional extraction on word vectors of multi-industry client problems to extract local features of texts and relation features of words and word composition words, the features are judged by the second classifier 230 after being encoded by the linear feature encoder 221, and the input client problems correspond to the probability of each classification.
The knowledge distilling apparatus 400 is used for measuring the difference between the likelihood estimation probability of the teacher model group 100 after passing through the likelihood estimation pooling device 300 and the likelihood estimation probability of the student model 200, the knowledge distilling apparatus 400 gives a numerical value to the difference between the likelihood estimation probability and the likelihood estimation probability of the student model 200, and parameter updating is carried out on the student model 200 through a common gradient descent algorithm until the distance between the likelihood estimation probability and the student model is not reduced any more.
The obtained convolutional network feature extractor 211 and linear feature encoder 221B of the student model 200 are used as the convolutional network feature extractor 511 and linear feature encoder 521 of the student pre-training model 500, so that the student pre-training model 500 is a lightweight text encoder, and can receive the word/word vectors of the client questions and encode the word/word vectors into the feature vectors of the client questions to be provided for specific task applications. The student pre-training model 500 may be used as a pre-training model for any natural language understanding task.
The customer problem feature vectors encoded by the student pre-training model 500 may be applied in an application layer, which may include a clustering device, a classifier, and a matcher. According to the task requirements, the clustering device can perform clustering operation on the customer problems expressed to the high-dimensional vector space; the matcher can map the feature vectors of the two customer problems to a similarity numerical value and judge whether the two customer problems are similar according to specific threshold setting; for classification tasks, such as graph recognition tasks, a classifier can be connected behind the student pre-training model 500, and training and learning are performed by intelligently serving data of corresponding commodity categories, so that an excellent intention recognition system is obtained.
After the probability estimates outputted by the teacher model group 100 (including models with various complex structures) pass through the likelihood estimation pooling device 300, an induction can be made on the estimation results of different teacher models 110, so that the classification probability of the client problem is more accurate, the comprehension ability of the real client problem is further improved, and the real client problem is taught to the student model 200. According to the experimental results, the F1 fraction is improved by at least 1%.
When the intelligent customer service pre-training model integrated accelerating device is used, firstly, a teacher model group 100 and a student model 200 are defined, then, customer problem data marked with classification labels are respectively input into the teacher model group 100 and the student model 200 for training, and the student model 200 outputs likelihood estimation probability values; the teacher model group 100 outputs likelihood estimation probability values corresponding to each teacher model 110; the likelihood estimation pooling device 300 performs pooling operation on the likelihood estimation probability values output by the teacher model group 100, and outputs pooled likelihood estimation probability values; the knowledge distilling device 400 measures the difference value between the likelihood estimation probability value of the teacher model group 100 after pooling and the likelihood estimation probability value of the student model 200; updating the parameters of the student model 200 to ensure that the likelihood estimation probability value of the student model 200 iterates to the likelihood estimation probability value of the teacher model group 100 after pooling, and finally obtaining the student model 100 of which the likelihood estimation probability value is closest to the likelihood estimation probability value of the teacher model group after pooling; taking the obtained convolution network feature extractor 211 and the linear feature encoder 221B of the student model 100 as a student pre-training model 500; the student pre-training model 500 predicts the customer problem, encodes it into a data feature vector, and provides it to the application layer.
By adopting the device of the embodiment, the model with huge computation amount is used as the teacher model 110, the likelihood estimation probability values of all the teacher models 110 in the teacher model group 100 are pooled, and the estimation results of different teacher models 110 are summarized, so that the classification probability of the client problem is more accurate, and the comprehension capability of the real client problem is further improved; comparing the likelihood estimation probability of the teacher model group 100 after pooling with the likelihood estimation probability of the student model 200 to obtain a difference value between the likelihood estimation probability of the teacher model group 100 and the likelihood estimation probability of the student model 200, updating the student model 200 according to the difference value to obtain the student model 200 with the likelihood estimation probability value closest to the likelihood estimation probability value of the teacher model group 100 after pooling, taking the feature extractor 510 and the feature encoder 520 of the obtained student model 200 as a student pre-training model 500, and migrating a large amount of knowledge already learned by the teacher model 110 and understanding patterns of the knowledge into the student model 200 through the updating process of the student model 200 so as to ensure the effect of the complex teacher model 110 and the speed of training data identification in a real scene; the obtained student pre-training model 500 encodes data to be trained into data feature vectors, can be applied to different processing tasks, can be repeatedly applied in one-time processing, and reduces the operation complexity.
The above-mentioned embodiments only express the specific embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims (8)

1. The method for integrating and accelerating the pre-training model based on knowledge distillation is characterized by comprising the following steps of:
defining a teacher model group, and defining a student model by the teacher model group;
respectively inputting the client consultation text training data labeled with the classification labels into a teacher model group and a student model for training, and outputting likelihood estimation probability values by the student model; the teacher model group outputs likelihood estimation probability values corresponding to all the teacher models;
the teacher model group comprises a plurality of teacher models, each teacher model comprises a long-short term memory network feature extractor, a first convolution network feature extractor, a first linear feature encoder and a first classifier, and the long-short term memory network feature extractor is a time sequence logic computing network and is used for extracting context relation features of texts; the method for obtaining the classification probability by the teacher model comprises the following steps: the word vectors of the client problems are subjected to long-term and short-term memory network feature extractor to obtain context feature information in a teacher model, local features of texts, relation features of words and words are extracted through convolution, the local features and the relation features are coded through a first linear feature coder and then are judged through a first classifier, and the probability that the input client problems correspond to each classification is obtained;
the student model comprises a second convolutional network feature extractor, a second linear feature encoder and a second classifier, and the method for obtaining the classification probability by the student model is as follows: performing convolution on word vectors of the customer problems to extract local features of texts and relation features of words and phrases formed by the words and phrases, wherein the local features and the relation features are coded by a second linear feature coder and then are judged by a second classifier to obtain the probability of each classification corresponding to the input customer problems;
performing pooling operation on the likelihood estimation probability value output by the teacher model group, and outputting the pooled likelihood estimation probability value;
measuring a difference value between the likelihood estimation probability value of the teacher model group after pooling and the likelihood estimation probability value of the student model;
updating the parameters of the student model to ensure that the likelihood estimation probability value of the student model iterates to the likelihood estimation probability value of the teacher model group after pooling, and finally obtaining the student model of which the likelihood estimation probability value is closest to the likelihood estimation probability value of the teacher model group after pooling;
taking the feature extractor and the feature encoder of the obtained student model as a student pre-training model;
the student pre-training model predicts client consultation text data to be trained and encodes the client consultation text data feature vectors.
2. The integrated acceleration method of pre-training models based on knowledge distillation of claim 1, characterized by: the teacher model group comprises a plurality of teacher models, and each teacher model comprises a first feature extractor, a first feature encoder and a first classifier.
3. The integrated acceleration method of pre-training models based on knowledge distillation of claim 2, characterized by: the first feature extractor comprises a convolution network feature extractor and a combination of a long-short term memory network feature extractor and a convolution network feature extractor.
4. The integrated acceleration method of pre-training models based on knowledge distillation of claim 1, characterized by: the pooling operation comprises an averaging operation and a weighted averaging operation; the averaging operation comprises: averaging likelihood estimation probability values corresponding to all teacher models output by the teacher model group; the weighted averaging operation comprises: and weighting likelihood estimation probability values corresponding to all teacher models output by the teacher model group and then averaging the likelihood estimation probability values.
5. The integrated acceleration method of pre-training models based on knowledge distillation of claim 1, characterized by: the student model comprises a second feature extractor, a second feature encoder and a second classifier.
6. The integrated acceleration method of pre-training models based on knowledge distillation of claim 1, characterized by: and the difference between the likelihood estimation probability value of the teacher model group after pooling and the likelihood estimation probability value of the student model is measured by adopting a cross entropy loss function or KL divergence.
7. The knowledge-distillation-based pre-training model integration acceleration method of claim 1, characterized in that: the parameters of the student model are updated and calculated by adopting a gradient descent algorithm.
8. Apparatus for integrating acceleration methods using a pre-trained knowledge-based distillation model according to any of claims 1-7, characterized in that: the system comprises a teacher model group, a likelihood estimation pooling device, a student model, a knowledge distilling device and a student pre-training model;
the teacher model group comprises a plurality of teacher models and is used for training the client consultation text training data labeled with the classification labels to obtain likelihood estimation probability values corresponding to the teacher models;
the student model is used for training the client consultation text training data marked with the classification labels to obtain likelihood estimation probability values corresponding to the student model;
the likelihood estimation pooling device is used for pooling likelihood estimation probability values output by the teacher model group and outputting the pooled likelihood estimation probability values;
the knowledge distilling device is used for measuring a difference value between the likelihood estimation probability value of the teacher model group after pooling and the likelihood estimation probability value of the student model, and updating parameters of the student model to obtain the student model of which the likelihood estimation probability value is closest to the likelihood estimation probability value of the teacher model group after pooling;
the student pre-training model comprises a feature extractor and a feature encoder of the obtained student model, and is used for encoding the client consultation text data to be trained into a client consultation text data feature vector.
CN201911134079.6A 2019-11-19 2019-11-19 Pre-training model integration acceleration method and device based on knowledge distillation Active CN110852426B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911134079.6A CN110852426B (en) 2019-11-19 2019-11-19 Pre-training model integration acceleration method and device based on knowledge distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911134079.6A CN110852426B (en) 2019-11-19 2019-11-19 Pre-training model integration acceleration method and device based on knowledge distillation

Publications (2)

Publication Number Publication Date
CN110852426A CN110852426A (en) 2020-02-28
CN110852426B true CN110852426B (en) 2023-03-24

Family

ID=69602619

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911134079.6A Active CN110852426B (en) 2019-11-19 2019-11-19 Pre-training model integration acceleration method and device based on knowledge distillation

Country Status (1)

Country Link
CN (1) CN110852426B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523324B (en) * 2020-03-18 2024-01-26 大箴(杭州)科技有限公司 Named entity recognition model training method and device
CN111506702A (en) * 2020-03-25 2020-08-07 北京万里红科技股份有限公司 Knowledge distillation-based language model training method, text classification method and device
CN111611377B (en) * 2020-04-22 2021-10-29 淮阴工学院 Knowledge distillation-based multi-layer neural network language model training method and device
CN111967224A (en) * 2020-08-18 2020-11-20 深圳市欢太科技有限公司 Method and device for processing dialog text, electronic equipment and storage medium
CN112184508B (en) * 2020-10-13 2021-04-27 上海依图网络科技有限公司 Student model training method and device for image processing
CN112465138A (en) * 2020-11-20 2021-03-09 平安科技(深圳)有限公司 Model distillation method, device, storage medium and equipment
US20220188622A1 (en) * 2020-12-10 2022-06-16 International Business Machines Corporation Alternative soft label generation
CN112836762A (en) * 2021-02-26 2021-05-25 平安科技(深圳)有限公司 Model distillation method, device, equipment and storage medium
CN112949786B (en) * 2021-05-17 2021-08-06 腾讯科技(深圳)有限公司 Data classification identification method, device, equipment and readable storage medium
CN113469977B (en) * 2021-07-06 2024-01-12 浙江霖研精密科技有限公司 Flaw detection device, method and storage medium based on distillation learning mechanism
CN113836903B (en) * 2021-08-17 2023-07-18 淮阴工学院 Enterprise portrait tag extraction method and device based on situation embedding and knowledge distillation
CN113673254B (en) * 2021-08-23 2022-06-07 东北林业大学 Knowledge distillation position detection method based on similarity maintenance
CN113837308B (en) * 2021-09-29 2022-08-05 北京百度网讯科技有限公司 Knowledge distillation-based model training method and device and electronic equipment
CN114241282B (en) * 2021-11-04 2024-01-26 河南工业大学 Knowledge distillation-based edge equipment scene recognition method and device
WO2023212997A1 (en) * 2022-05-05 2023-11-09 五邑大学 Knowledge distillation based neural network training method, device, and storage medium
CN115064155A (en) * 2022-06-09 2022-09-16 福州大学 End-to-end voice recognition incremental learning method and system based on knowledge distillation
CN114841173B (en) * 2022-07-04 2022-11-18 北京邮电大学 Academic text semantic feature extraction method and system based on pre-training model and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108921294A (en) * 2018-07-11 2018-11-30 浙江大学 A kind of gradual piece of knowledge distillating method accelerated for neural network
CN109616105A (en) * 2018-11-30 2019-04-12 江苏网进科技股份有限公司 A kind of noisy speech recognition methods based on transfer learning
CN109637546A (en) * 2018-12-29 2019-04-16 苏州思必驰信息科技有限公司 Knowledge distillating method and device
CN109829038A (en) * 2018-12-11 2019-05-31 平安科技(深圳)有限公司 Question and answer feedback method, device, equipment and storage medium based on deep learning
CN109871851A (en) * 2019-03-06 2019-06-11 长春理工大学 A kind of Chinese-character writing normalization determination method based on convolutional neural networks algorithm
CN110097178A (en) * 2019-05-15 2019-08-06 电科瑞达(成都)科技有限公司 It is a kind of paid attention to based on entropy neural network model compression and accelerated method
CN110135574A (en) * 2018-02-09 2019-08-16 北京世纪好未来教育科技有限公司 Neural network training method, image generating method and computer storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11410029B2 (en) * 2018-01-02 2022-08-09 International Business Machines Corporation Soft label generation for knowledge distillation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135574A (en) * 2018-02-09 2019-08-16 北京世纪好未来教育科技有限公司 Neural network training method, image generating method and computer storage medium
CN108921294A (en) * 2018-07-11 2018-11-30 浙江大学 A kind of gradual piece of knowledge distillating method accelerated for neural network
CN109616105A (en) * 2018-11-30 2019-04-12 江苏网进科技股份有限公司 A kind of noisy speech recognition methods based on transfer learning
CN109829038A (en) * 2018-12-11 2019-05-31 平安科技(深圳)有限公司 Question and answer feedback method, device, equipment and storage medium based on deep learning
CN109637546A (en) * 2018-12-29 2019-04-16 苏州思必驰信息科技有限公司 Knowledge distillating method and device
CN109871851A (en) * 2019-03-06 2019-06-11 长春理工大学 A kind of Chinese-character writing normalization determination method based on convolutional neural networks algorithm
CN110097178A (en) * 2019-05-15 2019-08-06 电科瑞达(成都)科技有限公司 It is a kind of paid attention to based on entropy neural network model compression and accelerated method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Improving the interpretability of deep neural networks with knowledge distillation;X Liu等;《2018 IEEE International Conference on Data Mining Workshops (ICDMW)》;20181228;1-8 *
一种基于模拟退火算法改进的卷积神经网络;满凤环等;《微电子学与计算机》;20171102;第34卷(第9期);58-62 *
基于深度特征蒸馏的人脸识别;葛仕明等;《北京交通大学学报》;20171215(第06期);32-38+46 *
基于用户隐性反馈行为的下一个购物篮推荐;李裕礞等;《中文信息学报》;20170915(第05期);220-227 *
深度神经网络压缩与加速综述;纪荣嵘等;《计算机研究与发展》;20180915;第55卷(第9期);1871-1888 *

Also Published As

Publication number Publication date
CN110852426A (en) 2020-02-28

Similar Documents

Publication Publication Date Title
CN110852426B (en) Pre-training model integration acceleration method and device based on knowledge distillation
CN111695415B (en) Image recognition method and related equipment
CN110609891A (en) Visual dialog generation method based on context awareness graph neural network
CN110969020A (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN113407660B (en) Unstructured text event extraction method
CN108537257B (en) Zero sample image classification method based on discriminant dictionary matrix pair
CN113822776B (en) Course recommendation method, device, equipment and storage medium
CN115422944A (en) Semantic recognition method, device, equipment and storage medium
CN110781686B (en) Statement similarity calculation method and device and computer equipment
Keren et al. Convolutional neural networks with data augmentation for classifying speakers' native language
CN112784778A (en) Method, apparatus, device and medium for generating model and identifying age and gender
Dai et al. Hybrid deep model for human behavior understanding on industrial internet of video things
CN113723238A (en) Human face lightweight network model construction method and human face recognition method
CN110110724A (en) The text authentication code recognition methods of function drive capsule neural network is squeezed based on exponential type
CN112988970A (en) Text matching algorithm serving intelligent question-answering system
CN113255366A (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN108805280B (en) Image retrieval method and device
CN114036298B (en) Node classification method based on graph convolution neural network and word vector
CN108496174B (en) Method and system for face recognition
CN114117039A (en) Small sample text classification method and model
CN116883746A (en) Graph node classification method based on partition pooling hypergraph neural network
CN116089605A (en) Text emotion analysis method based on transfer learning and improved word bag model
CN115687620A (en) User attribute detection method based on tri-modal characterization learning
CN114741487A (en) Image-text retrieval method and system based on image-text semantic embedding
CN114036947A (en) Small sample text classification method and system for semi-supervised learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant