CN110852426B

CN110852426B - Pre-training model integration acceleration method and device based on knowledge distillation

Info

Publication number: CN110852426B
Application number: CN201911134079.6A
Authority: CN
Inventors: 宋子文晗; 江岭
Original assignee: Chengdu Xiaoduo Technology Co ltd
Current assignee: Chengdu Xiaoduo Technology Co ltd
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2023-03-24
Anticipated expiration: 2039-11-19
Also published as: CN110852426A

Abstract

The invention discloses a pre-training model integration acceleration method and a device based on knowledge distillation, wherein the device uses the method and comprises the steps of defining a teacher model group and a student model; inputting training data labeled with classification labels into a teacher model group and a student model for training, and outputting likelihood estimation probability values corresponding to each teacher model and likelihood estimation probability values of the student models; pooling likelihood estimation probability values output by the teacher model group, and outputting the pooled likelihood estimation probability values; measuring a difference value between the likelihood estimation probability value of the teacher model group after pooling and the likelihood estimation probability value of the student model; updating the parameters of the student models to finally obtain the student models with likelihood estimation probability values closest to the likelihood estimation probability values after the teacher model clustering; and (3) using the obtained feature extractor and feature encoder of the student model as a student pre-training model to predict data to be trained, and encoding the data into a data feature vector.

Description

Pre-training model integration acceleration method and device based on knowledge distillation

Technical Field

The invention belongs to the technical field of neural network data processing, and particularly relates to a pre-training model integration acceleration method and device based on knowledge distillation.

Background

In recent years, convolutional neural networks have achieved tremendous success in related tasks in the computer vision field, such as face detection, image classification, natural language processing, and the like. For example, yann LeCun et al, university of New York, proposed the use of a multi-layer convolutional neural network for handwriting recognition, and the Hinton team used the deep neural network to obtain overwhelming winnings in ImageNet image classification games.

With the development of the convolutional neural network, the hierarchical structure design is more and more complex, the number of network parameters is more and more, and correspondingly, a training data set required for training an excellent convolutional neural network is more huge. Therefore, the time and space complexity and the storage cost in the operation process are greatly increased, and the existing large convolutional neural network depends on a high-performance processor and a cluster server with extremely strong operation capability. The huge computation load, time consumption and energy consumption make the convolutional neural network difficult to be deployed on mobile devices with limited computing resources and energy storage, such as mobile phones and intelligent wearable devices. Therefore, compressing the parameters of the large neural network and reducing the computational complexity are important research directions.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a pre-training model integration acceleration method and device based on knowledge distillation, in the method, a model with huge operand is used as a teacher model, the likelihood estimation probability values of all the teacher models in a teacher model group are subjected to pooling operation, and the estimation results of different teacher models are summarized, so that the classification probability of data is more accurate, and the comprehension capability of the data is further improved; comparing likelihood estimation probability after pooling of a teacher model group with likelihood estimation probability of a student model to obtain a difference value between the likelihood estimation probability and the teacher model, updating the student model according to the difference value to obtain the student model with the likelihood estimation probability value closest to the likelihood estimation probability value after pooling of the teacher model group, taking a feature extractor and a feature encoder of the obtained student model as a student pre-training model, and transferring a large amount of learned knowledge of the teacher model and knowledge understanding modes of the knowledge to the student model through the process of updating the student model so as to ensure the effect of a complex teacher model and the speed of training data identification in a real scene; the obtained pre-training model of the student encodes data to be trained into data characteristic vectors, can be applied to different processing tasks, can be repeatedly applied after one-time processing, and reduces the operation complexity.

In order to achieve the above purpose, the solution adopted by the invention is as follows: the integrated acceleration method of the pre-training model based on knowledge distillation comprises the following steps:

defining a teacher model group, wherein the teacher model group comprises a plurality of teacher models, each teacher model comprises a first feature extractor, a first feature encoder and a first classifier, and the first feature extractor comprises a convolutional network feature extractor and the combination of a long-short term memory network feature extractor and a convolutional network feature extractor; defining a student model, wherein the student model comprises a second feature extractor, a second feature encoder and a second classifier; the teacher model group comprises a large number of teacher models which are trained and have excellent identification capability, a first feature extractor and a first encoder of each teacher model are trained and iterated through labeled training data, and a second feature extractor and a second feature encoder of each student model are original and untrained.

Respectively inputting the training data labeled with the classification labels into a teacher model group and a student model for training, and outputting likelihood estimation probability values by the student model; the teacher model group outputs likelihood estimation probability values corresponding to all the teacher models; the teacher model group outputs the output result of each teacher model, and the results can complement the judgment results of the certificate pair classification, so that the judgment error is reduced, and the prediction accuracy is improved.

Performing pooling operation on the likelihood estimation probability values output by the teacher model group, and outputting the pooled likelihood estimation probability values, wherein the pooling operation comprises averaging operation and weighting averaging operation; the averaging operation comprises: averaging likelihood estimation probability values corresponding to all teacher models output by the teacher model group; the weighted averaging operation comprises: weighting likelihood estimation probability values corresponding to all teacher models output by the teacher model group, then averaging, and smoothing errors caused by a single teacher model by calculating the average value or weighted average value of all the teacher model output probabilities, thereby improving the prediction accuracy;

measuring a difference value between the likelihood estimation probability value of the teacher model group after pooling and the likelihood estimation probability value of the student model;

updating the parameters of the student model by adopting gradient descent algorithm calculation, so that the likelihood estimation probability value of the student model iterates to the likelihood estimation probability value of the teacher model group after pooling, and finally obtaining the student model of which the likelihood estimation probability value is closest to the likelihood estimation probability value of the teacher model group after pooling;

taking the feature extractor and the feature encoder of the obtained student model as a student pre-training model;

the student pre-training model predicts the data to be trained, encodes the data into a data feature vector, and provides the data feature vector to downstream applications such as classification, clustering and matching.

The difference between the likelihood estimation probability value of the teacher model group after pooling and the likelihood estimation probability value of the student model is measured by adopting a cross entropy loss function or KL divergence (measuring the distance between two probability distributions).

The device for applying the pre-training model integration acceleration method based on knowledge distillation comprises a teacher model group, a likelihood estimation pooling device, a student model, a knowledge distillation device and a student pre-training model;

the teacher model group comprises a plurality of teacher models and is used for training the training data labeled with the classification labels to obtain likelihood estimation probability values corresponding to the teacher models, the teacher models can be the same or different, and the likelihood estimation probability values formed by pooling the teacher model group consisting of different teacher models are beneficial to the migration of student models to more knowledge and understanding modes of knowledge;

the student model is used for training the training data marked with the classification labels to obtain likelihood estimation probability values corresponding to the student model;

the likelihood estimation pooling device is used for pooling likelihood estimation probability values output by the teacher model group and outputting the pooled likelihood estimation probability values, the likelihood estimation pooling device collects prediction results of a plurality of teacher models in the teacher model group, errors caused by a single model are smoothed in a mode of calculating the average value of the output probabilities of all the teacher models, and the prediction accuracy is improved;

the knowledge distilling device is used for measuring a difference value between the likelihood estimation probability value of the teacher model group after pooling and the likelihood estimation probability value of the student model, and updating parameters of the student model to obtain the student model of which the likelihood estimation probability value is closest to the likelihood estimation probability value of the teacher model group after pooling;

the student pre-training model is used for a feature extractor and a feature encoder which comprise the obtained student model, and is used for encoding data to be trained into data feature vectors. The data characteristic vectors processed by the student pre-training model can be applied to different processing tasks, and can be repeatedly applied after one-time processing, so that the operation complexity is reduced. For example, the clustering device is used for clustering; classifying the application in a classifier; the method is applied to a matcher for matching.

The teacher model comprises a first feature extractor, a first feature encoder and a first classifier.

The student model comprises a second feature extractor, a second feature encoder and a second classifier.

The invention has the beneficial effects that:

(1) The method takes a model with huge computation amount as a teacher model, performs pooling operation on the likelihood estimation probability value of each teacher model in a teacher model group, and sums up the estimation results of different teacher models, so that the classification probability of data is more accurate, and the comprehension capability of the data is further improved; comparing likelihood estimation probability after pooling of a teacher model group with likelihood estimation probability of a student model to obtain a difference value between the likelihood estimation probability and the teacher model, updating the student model according to the difference value to obtain the student model with the likelihood estimation probability value closest to the likelihood estimation probability value after pooling of the teacher model group, taking a feature extractor and a feature encoder of the obtained student model as a student pre-training model, and transferring a large amount of learned knowledge of the teacher model and knowledge understanding modes of the knowledge to the student model through the process of updating the student model so as to ensure the effect of a complex teacher model and the speed of training data identification in a real scene; the obtained pre-training model of the student encodes data to be trained into data characteristic vectors, can be applied to different processing tasks, can be repeatedly applied after one-time processing, and reduces the operation complexity.

Drawings

FIG. 1 is a flow chart of a pre-training model integration acceleration method of the present invention;

FIG. 2 is a diagram of an integrated acceleration device for a pre-training model according to the present invention;

FIG. 3 is a diagram of an intelligent customer service pre-training model integrated accelerator according to an embodiment of the present invention;

in the figure, 100-teacher model clique, 110-teacher model, 111-first feature extractor, 111A-long and short term memory network feature extractor, 111B-convolutional network feature extractor, 112-first feature encoder, 1112B-linear feature encoder, 113-first classifier, 200-student model, 210-second feature extractor, 211-convolutional network feature extractor, 220-second feature encoder, 221-linear feature encoder, 230-second classifier, 300-likelihood estimation pooling device, 400-knowledge distilling device, 500-student pre-training model, 510-feature extractor, 511-convolutional network feature extractor, 520-feature encoder, 521-linear feature encoder.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

as shown in FIG. 1, the method for integrating and accelerating the pre-training model based on knowledge distillation comprises the following steps:

defining a teacher model clique 100, wherein the teacher model clique 100 comprises a plurality of teacher models 110, each teacher model 110 comprises a first feature extractor 111, a first feature encoder 112 and a first classifier 113, and the first feature extractor 111 comprises a convolution network feature extractor and a combination of a long-short term memory network feature extractor and a convolution network feature extractor; defining a student model 200, wherein the student model 200 comprises a second feature extractor 210, a second feature encoder 220 and a second classifier 230; the teacher model clique 100 includes a plurality of teacher models 110 that have been trained and have excellent recognition capability, the first feature extractor 111 and the first encoder 112 of the teacher models 110 are trained and iterated through labeled training data, and the student models 200 use the original, untrained second feature extractor 210 and the second feature encoder 220.

Respectively inputting the training data labeled with the classification labels into a teacher model group 100 and a student model 200 for training, wherein the student model 200 directly calculates and outputs likelihood estimation probability values according to original parameters of a second feature extractor 210 and a second feature encoder 220 without iteration; the teacher model clique 100 outputs the likelihood estimation probability value corresponding to each teacher model 110, and the teacher models 110 do not need to iterate, and directly calculate the likelihood estimation probability value of the training data in the teacher model; the teacher model group 100 outputs the output result of each teacher model 110, and the results can complement the judgment result of the mapping pair classification, thereby reducing the judgment error and improving the accuracy of prediction.

Pooling the likelihood estimation probability values output by the teacher model group 100, and outputting the pooled likelihood estimation probability values, wherein the pooling comprises averaging and weighting averaging; the averaging operation comprises: averaging likelihood estimation probability values corresponding to each teacher model 110 output by the teacher model clique 100; the weighted averaging operation comprises: weighting likelihood estimation probability values corresponding to each teacher model 110 output by the teacher model group 100, then averaging, and smoothing errors caused by a single teacher model 110 by calculating an average value or weighted average value of the probabilities output by all the teacher models 110, thereby improving the accuracy of prediction;

measuring a difference value between the likelihood estimation probability value of the teacher model group 100 after pooling and the likelihood estimation probability value of the student model 200;

updating the parameters of the student model 200 by adopting gradient descent algorithm calculation, so that the likelihood estimation probability values of the student model 200 are iterated to the likelihood estimation probability values of the teacher model group 100 after pooling, and finally obtaining the student model 200 of which the likelihood estimation probability values are closest to the likelihood estimation probability values of the teacher model group 100 after pooling;

taking the feature extractor 210 and the feature encoder 220 of the obtained student model 200 as a student pre-training model 500;

the student pre-training model 500 predicts the data to be trained, encodes the data into data feature vectors, and provides the data feature vectors to downstream applications such as classification, clustering, and matching.

The difference between the likelihood estimation probability value of the teacher model group 100 after pooling and the likelihood estimation probability value of the student model 200 is measured by adopting a cross entropy loss function or KL divergence (measuring the distance between two probability distributions).

As shown in fig. 2, the apparatus for applying the knowledge distillation-based pre-training model integrated acceleration method includes a teacher model group 100, a likelihood estimation pooling device 300, a student model 200, a knowledge distillation apparatus 400, and a student pre-training model 500;

the teacher model group 100 comprises a plurality of teacher models 110, and is used for training data labeled with classification labels to obtain likelihood estimation probability values corresponding to the teacher models 110, the teacher models 110 may be the same or different, and the likelihood estimation probability values after the teacher model group 100 formed by the different teacher models 110 is pooled are beneficial for the student models 200 to transfer to more knowledge and understanding modes of the knowledge;

the student model 200 is used for training the training data labeled with the classification labels to obtain likelihood estimation probability values corresponding to the student model 200;

the likelihood estimation pooling device 300 is used for pooling likelihood estimation probability values output by the teacher model group 100 and outputting the pooled likelihood estimation probability values, the likelihood estimation pooling device 300 collects the prediction results of a plurality of teacher models 110 in the teacher model group 100, errors caused by a single model are smoothed by calculating the average value of the output probabilities of all the teacher models 110, and the prediction accuracy is improved;

the knowledge distilling device 400 is used for measuring the difference value between the likelihood estimation probability value of the teacher model group 100 after pooling and the likelihood estimation probability value of the student model 200, and updating the parameters of the student model 200 to obtain the student model 200 of which the likelihood estimation probability value is closest to the likelihood estimation probability value of the teacher model group 100 after pooling;

the student pre-training model 500 includes a feature extractor 510 and a feature encoder 520 of the obtained student model 200, and is used for encoding data to be trained into a data feature vector. The data characteristic vectors processed by the student pre-training model can be applied to different processing tasks, and can be repeatedly applied after one-time processing, so that the operation complexity is reduced. For example, the method is applied to a clustering device to perform clustering; the application is classified in a classifier; the method is applied to a matcher for matching.

The teacher model 110 includes a first feature extractor 111, a first feature encoder 112, and a first classifier 113.

The student model 200 includes a second feature extractor 210, a second feature encoder 220, and a second classifier 230.

Example one

With the rapid development of the e-commerce industry, online shopping becomes an indispensable daily routine for most people, and in the online shopping process, consumers often consult merchants about the performance of products, the services of the merchants, the proper size of the products and other problems, so that the merchants on various e-commerce platforms need to recruit a large number of customer service staff to ask customers for questions and solve confusion, and the increasing consulting amount gradually increases the demands of the merchants on customer service robots. In the field of intelligent customer service, intention identification is an important task, aiming at understanding problems issued by buyers in a customer service scene. And then, carrying out subsequent related operation or reply aiming at the identified buyer intention.

The existing common natural language understanding model is a long-short term memory network (LSTM), is a time-series type calculation, has low parallelism degree, and is difficult to reduce the calculation amount through strategies such as pruning and the like. Recently emerging pre-trained models for natural language understanding (BERT, XLNet, etc.) all have hundreds of millions of parameters, high computational complexity, high requirements on computing devices, and long response times. Simple models (CNN, transformer, etc.) perform poorly after pre-training.

Meanwhile, as the consulting amount is increased continuously, the computing demand of the intention identification module on the computing device is expanded continuously, and the response time of the request becomes a bottleneck. Therefore, it is necessary to reduce the computation requirement, increase the response speed, reduce the complexity of the model, and increase the computation parallelism.

Fig. 3 is a schematic diagram of an intelligent customer service pre-training model integrated acceleration device to which the pre-training model integrated acceleration method of the present application is applied. The intelligent customer service pre-training model integrated accelerating device comprises a teacher model group 100, a student model 200, a likelihood estimation pooling device 300, a knowledge distilling device 400 and a student pre-training model 500.

The teacher model group 100 includes a plurality of teacher models 110, the teacher models 110 include a Long-short Term Memory Network feature extractor 111A, a convolutional Network feature extractor 111B, a linear feature encoder 112B, and a first classifier 113, and the Long-short Term Memory Network feature extractor 111A (Long-short Term Memory Network) is a time-series logic computing Network and can extract the context relationship features of the text. The convolutional network feature extractor 111B may extract local features of more texts, and obtain relationship features of words and words-forming words. The linear feature encoder 112B further compression encodes and transforms the feature space of the feature vector upstream for faster and better operation later. The first classifier 113 contains a softmax function, gives scores for different classifications based on upstream feature codes, and calculates the probability that the input customer question corresponds to each classification. The teacher models 110 are models having a large parameter amount and high computational complexity, and have excellent intention recognition capability through training of a large number of client questions and corresponding intention labels in various industries (clothing, shoe bags, electric appliances, daily necessities, and the like). The word vectors of the multi-industry client problems are firstly subjected to sufficient context feature information through a long-term and short-term memory network feature extractor 111A which is complex in calculation and cannot be subjected to parallel calculation in a teacher model 110, then the local features, words and word composition word relation features of more texts are extracted through convolution, the features are judged through a first classifier 113 after being encoded by a linear feature encoder 112A, and the input client problems correspond to the probability of each classification.

The likelihood estimation pooling device 300 obtains the probabilities output by each teacher model 110 in the teacher model group 100 to perform pooling operation, and smoothes errors caused by a single model by calculating an average value of the probabilities output by all the teacher models 110, thereby improving the prediction accuracy.

The student model 200 comprises a convolutional network feature extractor 211, a linear feature encoder 221 and a second classifier 230, the student model 200 performs convolutional extraction on word vectors of multi-industry client problems to extract local features of texts and relation features of words and word composition words, the features are judged by the second classifier 230 after being encoded by the linear feature encoder 221, and the input client problems correspond to the probability of each classification.

The knowledge distilling apparatus 400 is used for measuring the difference between the likelihood estimation probability of the teacher model group 100 after passing through the likelihood estimation pooling device 300 and the likelihood estimation probability of the student model 200, the knowledge distilling apparatus 400 gives a numerical value to the difference between the likelihood estimation probability and the likelihood estimation probability of the student model 200, and parameter updating is carried out on the student model 200 through a common gradient descent algorithm until the distance between the likelihood estimation probability and the student model is not reduced any more.

The obtained convolutional network feature extractor 211 and linear feature encoder 221B of the student model 200 are used as the convolutional network feature extractor 511 and linear feature encoder 521 of the student pre-training model 500, so that the student pre-training model 500 is a lightweight text encoder, and can receive the word/word vectors of the client questions and encode the word/word vectors into the feature vectors of the client questions to be provided for specific task applications. The student pre-training model 500 may be used as a pre-training model for any natural language understanding task.

The customer problem feature vectors encoded by the student pre-training model 500 may be applied in an application layer, which may include a clustering device, a classifier, and a matcher. According to the task requirements, the clustering device can perform clustering operation on the customer problems expressed to the high-dimensional vector space; the matcher can map the feature vectors of the two customer problems to a similarity numerical value and judge whether the two customer problems are similar according to specific threshold setting; for classification tasks, such as graph recognition tasks, a classifier can be connected behind the student pre-training model 500, and training and learning are performed by intelligently serving data of corresponding commodity categories, so that an excellent intention recognition system is obtained.

After the probability estimates outputted by the teacher model group 100 (including models with various complex structures) pass through the likelihood estimation pooling device 300, an induction can be made on the estimation results of different teacher models 110, so that the classification probability of the client problem is more accurate, the comprehension ability of the real client problem is further improved, and the real client problem is taught to the student model 200. According to the experimental results, the F1 fraction is improved by at least 1%.

When the intelligent customer service pre-training model integrated accelerating device is used, firstly, a teacher model group 100 and a student model 200 are defined, then, customer problem data marked with classification labels are respectively input into the teacher model group 100 and the student model 200 for training, and the student model 200 outputs likelihood estimation probability values; the teacher model group 100 outputs likelihood estimation probability values corresponding to each teacher model 110; the likelihood estimation pooling device 300 performs pooling operation on the likelihood estimation probability values output by the teacher model group 100, and outputs pooled likelihood estimation probability values; the knowledge distilling device 400 measures the difference value between the likelihood estimation probability value of the teacher model group 100 after pooling and the likelihood estimation probability value of the student model 200; updating the parameters of the student model 200 to ensure that the likelihood estimation probability value of the student model 200 iterates to the likelihood estimation probability value of the teacher model group 100 after pooling, and finally obtaining the student model 100 of which the likelihood estimation probability value is closest to the likelihood estimation probability value of the teacher model group after pooling; taking the obtained convolution network feature extractor 211 and the linear feature encoder 221B of the student model 100 as a student pre-training model 500; the student pre-training model 500 predicts the customer problem, encodes it into a data feature vector, and provides it to the application layer.

By adopting the device of the embodiment, the model with huge computation amount is used as the teacher model 110, the likelihood estimation probability values of all the teacher models 110 in the teacher model group 100 are pooled, and the estimation results of different teacher models 110 are summarized, so that the classification probability of the client problem is more accurate, and the comprehension capability of the real client problem is further improved; comparing the likelihood estimation probability of the teacher model group 100 after pooling with the likelihood estimation probability of the student model 200 to obtain a difference value between the likelihood estimation probability of the teacher model group 100 and the likelihood estimation probability of the student model 200, updating the student model 200 according to the difference value to obtain the student model 200 with the likelihood estimation probability value closest to the likelihood estimation probability value of the teacher model group 100 after pooling, taking the feature extractor 510 and the feature encoder 520 of the obtained student model 200 as a student pre-training model 500, and migrating a large amount of knowledge already learned by the teacher model 110 and understanding patterns of the knowledge into the student model 200 through the updating process of the student model 200 so as to ensure the effect of the complex teacher model 110 and the speed of training data identification in a real scene; the obtained student pre-training model 500 encodes data to be trained into data feature vectors, can be applied to different processing tasks, can be repeatedly applied in one-time processing, and reduces the operation complexity.

The above-mentioned embodiments only express the specific embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. The method for integrating and accelerating the pre-training model based on knowledge distillation is characterized by comprising the following steps of:

defining a teacher model group, and defining a student model by the teacher model group;

respectively inputting the client consultation text training data labeled with the classification labels into a teacher model group and a student model for training, and outputting likelihood estimation probability values by the student model; the teacher model group outputs likelihood estimation probability values corresponding to all the teacher models;

the teacher model group comprises a plurality of teacher models, each teacher model comprises a long-short term memory network feature extractor, a first convolution network feature extractor, a first linear feature encoder and a first classifier, and the long-short term memory network feature extractor is a time sequence logic computing network and is used for extracting context relation features of texts; the method for obtaining the classification probability by the teacher model comprises the following steps: the word vectors of the client problems are subjected to long-term and short-term memory network feature extractor to obtain context feature information in a teacher model, local features of texts, relation features of words and words are extracted through convolution, the local features and the relation features are coded through a first linear feature coder and then are judged through a first classifier, and the probability that the input client problems correspond to each classification is obtained;

the student model comprises a second convolutional network feature extractor, a second linear feature encoder and a second classifier, and the method for obtaining the classification probability by the student model is as follows: performing convolution on word vectors of the customer problems to extract local features of texts and relation features of words and phrases formed by the words and phrases, wherein the local features and the relation features are coded by a second linear feature coder and then are judged by a second classifier to obtain the probability of each classification corresponding to the input customer problems;

performing pooling operation on the likelihood estimation probability value output by the teacher model group, and outputting the pooled likelihood estimation probability value;

updating the parameters of the student model to ensure that the likelihood estimation probability value of the student model iterates to the likelihood estimation probability value of the teacher model group after pooling, and finally obtaining the student model of which the likelihood estimation probability value is closest to the likelihood estimation probability value of the teacher model group after pooling;

the student pre-training model predicts client consultation text data to be trained and encodes the client consultation text data feature vectors.

2. The integrated acceleration method of pre-training models based on knowledge distillation of claim 1, characterized by: the teacher model group comprises a plurality of teacher models, and each teacher model comprises a first feature extractor, a first feature encoder and a first classifier.

3. The integrated acceleration method of pre-training models based on knowledge distillation of claim 2, characterized by: the first feature extractor comprises a convolution network feature extractor and a combination of a long-short term memory network feature extractor and a convolution network feature extractor.

4. The integrated acceleration method of pre-training models based on knowledge distillation of claim 1, characterized by: the pooling operation comprises an averaging operation and a weighted averaging operation; the averaging operation comprises: averaging likelihood estimation probability values corresponding to all teacher models output by the teacher model group; the weighted averaging operation comprises: and weighting likelihood estimation probability values corresponding to all teacher models output by the teacher model group and then averaging the likelihood estimation probability values.

5. The integrated acceleration method of pre-training models based on knowledge distillation of claim 1, characterized by: the student model comprises a second feature extractor, a second feature encoder and a second classifier.

6. The integrated acceleration method of pre-training models based on knowledge distillation of claim 1, characterized by: and the difference between the likelihood estimation probability value of the teacher model group after pooling and the likelihood estimation probability value of the student model is measured by adopting a cross entropy loss function or KL divergence.

7. The knowledge-distillation-based pre-training model integration acceleration method of claim 1, characterized in that: the parameters of the student model are updated and calculated by adopting a gradient descent algorithm.

8. Apparatus for integrating acceleration methods using a pre-trained knowledge-based distillation model according to any of claims 1-7, characterized in that: the system comprises a teacher model group, a likelihood estimation pooling device, a student model, a knowledge distilling device and a student pre-training model;

the teacher model group comprises a plurality of teacher models and is used for training the client consultation text training data labeled with the classification labels to obtain likelihood estimation probability values corresponding to the teacher models;

the student model is used for training the client consultation text training data marked with the classification labels to obtain likelihood estimation probability values corresponding to the student model;

the likelihood estimation pooling device is used for pooling likelihood estimation probability values output by the teacher model group and outputting the pooled likelihood estimation probability values;

the student pre-training model comprises a feature extractor and a feature encoder of the obtained student model, and is used for encoding the client consultation text data to be trained into a client consultation text data feature vector.