CN117786107A

CN117786107A - Training method and device of text classification model, medium and electronic equipment

Info

Publication number: CN117786107A
Application number: CN202311754776.8A
Authority: CN
Inventors: 陆金星; 陈欢; 赵智源; 赵建英
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-12-18
Filing date: 2023-12-18
Publication date: 2024-03-29

Abstract

The specification discloses a training method, device, medium and electronic equipment of a text classification model, wherein the method comprises the following steps: a text sample is determined and a number of teacher models trained in advance are determined. And inputting a text sample into the teacher model according to the sequence of the parameter quantity of each teacher model from small to large, determining a pseudo-standard result, inputting the text sample into a student model to be trained, determining a classification result, and training the student model to be trained at least according to the pseudo-standard result and the classification result which are obtained based on the teacher model. And then taking the trained student model as a text classification model. And the students are guided to train by the teachers, so that the text characterization capability and the classification accuracy of the text classification model are improved.

Description

Training method and device of text classification model, medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training method and apparatus for a text classification model, a medium, and an electronic device.

Background

With the development of information technology, text classification models are increasingly widely applied. Meanwhile, privacy data is also receiving public attention.

Currently, text classification models are generally trained based on text data and the category to which the text data corresponds. However, since the parameters of the text classification model are relatively large, and the time consumed in model training, model deployment and model reasoning is relatively long, the lightweight text classification model can be trained based on the existing large-scale text classification model by adopting a knowledge distillation mode. Therefore, how to train a text classification model by means of knowledge distillation is a very important issue.

Based on the above, a training method of a text classification model is provided in the specification.

Disclosure of Invention

The present disclosure provides a training method, device, medium and electronic device for a text classification model, so as to partially solve the above-mentioned problems in the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides a training method of a text classification model, which comprises the following steps:

determining a text sample and determining a plurality of pre-trained teacher models; wherein, the parameters of each teacher model are different;

and executing the following steps for each teacher model in sequence from small to large according to the parameter quantity of each teacher model: inputting the text sample into the teacher model, determining a pseudo-standard result, inputting the text sample into a student model to be trained, determining a classification result, and training the student model to be trained at least according to the pseudo-standard result and the classification result obtained based on the teacher model;

Taking the trained student model as a text classification model; the text classification model is used for determining a classification result of the text to be classified according to the text to be classified.

Optionally, the student model to be trained comprises a feature extraction layer and a classification layer;

inputting the text sample into a student model to be trained, and determining a classification result, wherein the method specifically comprises the following steps of:

inputting the text sample into a feature extraction layer of a student model to be trained, and determining a feature sequence corresponding to the text sample;

taking the characteristics of the position corresponding to the teacher model in the characteristic sequence as output characteristics;

and inputting the output characteristics into a classification layer of the student model to be trained, and determining classification results.

Optionally, training the student model to be trained at least according to the pseudo-label result obtained based on the teacher model and the classification result, which specifically includes:

taking the feature corresponding to the appointed position in the feature sequence as a first feature;

inputting the first characteristics into a classification layer of the student model to be trained, and determining a first result;

determining a label corresponding to the text sample;

and training the student model to be trained according to the pseudo-label result, the classification result, the first result and the label which are obtained based on the teacher model.

Optionally, training the student model to be trained according to the pseudo-label result, the classification result, the first result and the label obtained based on the teacher model, which specifically includes:

determining a first task loss according to the first result and the label;

determining a second task loss according to a pseudo-label result and the classification result obtained based on the teacher model;

and training the student model to be trained according to the first task loss and the second task loss.

determining other teacher models according to the parameter quantity of the teacher model; wherein the parameter amount of the other teacher model is smaller than the parameter amount of the teacher model;

taking the features of the positions corresponding to the other teacher models in the feature sequence as second features;

inputting the second features into a classification layer of the student model to be trained, and determining a second result;

determining pseudo-label results corresponding to the other teacher models and taking the pseudo-label results as other results;

And training the student model to be trained at least according to the pseudo-standard result, the classification result, the second result and the other results obtained based on the teacher model.

Optionally, training the student model to be trained at least according to the pseudo-label result, the classification result, the second result and the other results obtained based on the teacher model, specifically including:

determining a label corresponding to the text sample;

and training the student model to be trained according to the pseudo-label result, the classification result, the second result, the other results, the first result and the label obtained based on the teacher model.

Optionally, training the student model to be trained according to the pseudo-label result, the classification result, the second result, the other results, the first result and the label obtained based on the teacher model, which specifically includes:

determining a first task loss according to the first result and the label;

determining a third task loss according to the second result and the other results;

and training the student model to be trained according to the first task loss, the second task loss and the third task loss.

Optionally, training the student model to be trained according to the first task loss, the second task loss and the third task loss specifically includes:

respectively weighting the first task loss, the second task loss and the third task loss according to the assigned weight;

and training the student model to be trained according to the weighted first task loss, the weighted second task loss and the weighted third task loss.

Optionally, training a plurality of teacher models in advance specifically includes:

determining a label corresponding to the text sample;

and training each teacher model to be trained based on the text sample and the label.

Optionally, after determining the text sample, the method further comprises:

Determining an initial student model and determining a label corresponding to the text sample;

and training the initial student model based on the text sample and the label to obtain a student model to be trained.

Optionally, the method further comprises:

determining input text of a user in response to input operation of the user;

determining a pre-stored standard text;

taking the standard text and the input text as texts to be classified;

inputting the text to be classified into the text classification model, and determining a classification result of the text to be classified;

and when the classification result is similar, determining a reply text corresponding to the standard text, and displaying the reply text to the user.

The specification provides a training device of text classification model, including:

the first determining module is used for determining a text sample and determining a plurality of pre-trained teacher models; wherein, the parameters of each teacher model are different;

the first training module is configured to execute, for each teacher model, in order from smaller parameter to larger parameter, the following steps: inputting the text sample into the teacher model, determining a pseudo-standard result, inputting the text sample into a student model to be trained, determining a classification result, and training the student model to be trained at least according to the pseudo-standard result and the classification result obtained based on the teacher model;

The second determining module is used for taking the trained student model as a text classification model; the text classification model is used for determining a classification result of the text to be classified according to the text to be classified.

the first training module is specifically configured to input the text sample into a feature extraction layer of a student model to be trained, and determine a feature sequence corresponding to the text sample; taking the characteristics of the position corresponding to the teacher model in the characteristic sequence as output characteristics; and inputting the output characteristics into a classification layer of the student model to be trained, and determining classification results.

Optionally, the first training module is specifically configured to take a feature corresponding to a specified position in the feature sequence as a first feature; inputting the first characteristics into a classification layer of the student model to be trained, and determining a first result; determining a label corresponding to the text sample; and training the student model to be trained according to the pseudo-label result, the classification result, the first result and the label which are obtained based on the teacher model.

Optionally, the first training module is specifically configured to determine a first task loss according to the first result and the label; determining a second task loss according to a pseudo-label result and the classification result obtained based on the teacher model; and training the student model to be trained according to the first task loss and the second task loss.

Optionally, the first training module is specifically configured to determine other teacher models according to the parameter amounts of the teacher models; wherein the parameter amount of the other teacher model is smaller than the parameter amount of the teacher model; taking the features of the positions corresponding to the other teacher models in the feature sequence as second features; inputting the second features into a classification layer of the student model to be trained, and determining a second result; determining pseudo-label results corresponding to the other teacher models and taking the pseudo-label results as other results; and training the student model to be trained at least according to the pseudo-standard result, the classification result, the second result and the other results obtained based on the teacher model.

Optionally, the first training module is specifically configured to take a feature corresponding to a specified position in the feature sequence as a first feature; inputting the first characteristics into a classification layer of the student model to be trained, and determining a first result; determining a label corresponding to the text sample; and training the student model to be trained according to the pseudo-label result, the classification result, the second result, the other results, the first result and the label obtained based on the teacher model.

Optionally, the first training module is specifically configured to determine a first task loss according to the first result and the label; determining a second task loss according to a pseudo-label result and the classification result obtained based on the teacher model; determining a third task loss according to the second result and the other results; and training the student model to be trained according to the first task loss, the second task loss and the third task loss.

Optionally, the first training module is specifically configured to weight the first task loss, the second task loss, and the third task loss according to an assigned weight; and training the student model to be trained according to the weighted first task loss, the weighted second task loss and the weighted third task loss.

Optionally, the apparatus further comprises:

the second training module is used for determining labels corresponding to the text samples; and training each teacher model to be trained based on the text sample and the label.

Optionally, the first determining module is further configured to determine an initial student model after determining the text sample, and determine a label corresponding to the text sample; and training the initial student model based on the text sample and the label to obtain a student model to be trained.

Optionally, the apparatus further comprises:

the application module is used for responding to the input operation of a user and determining the input text of the user; determining a pre-stored standard text; taking the standard text and the input text as texts to be classified; inputting the text to be classified into the text classification model, and determining a classification result of the text to be classified; and when the classification result is similar, determining a reply text corresponding to the standard text, and displaying the reply text to the user.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the training method of the text classification model described above.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the training method of the text classification model described above when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

in the training method of the text classification model provided by the specification, a text sample and a plurality of pre-trained teacher models are determined. And inputting a text sample into the teacher model according to the sequence of the parameter quantity of each teacher model from small to large, determining a pseudo-standard result, inputting the text sample into a student model to be trained, determining a classification result, and training the student model to be trained at least according to the pseudo-standard result and the classification result which are obtained based on the teacher model. And then taking the trained student model as a text classification model.

As can be seen from the above method, when the method trains text classification models, text samples are determined and a plurality of pre-trained teacher models are determined. And inputting a text sample into the teacher model according to the sequence of the parameter quantity of each teacher model from small to large, determining a pseudo-standard result, inputting the text sample into a student model to be trained, determining a classification result, and training the student model to be trained at least according to the pseudo-standard result and the classification result which are obtained based on the teacher model. And then taking the trained student model as a text classification model. And the students are guided to train by the teachers, so that the text characterization capability and the classification accuracy of the text classification model are improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. Attached at

In the figure:

FIG. 1 is a flow chart of a training method of a text classification model provided in the present specification;

FIG. 2 is a schematic view of the structure of a student model provided in the present specification;

FIG. 3 is a schematic representation of the feature sequences provided in the present specification;

FIG. 4 is a schematic diagram of a training device for text classification models provided in the present specification;

fig. 5 is a schematic view of the electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The embodiments of the present disclosure provide a training method, device, medium and electronic device for a text classification model, and in the following, with reference to the drawings, the technical solutions provided by each embodiment of the present disclosure are described in detail.

Fig. 1 is a flow chart of a training method of a text classification model provided in the present specification, specifically including the following steps:

s100: determining a text sample and determining a plurality of pre-trained teacher models; wherein, the parameters of each teacher model are different.

In this specification, an apparatus for training text classification models may determine a text sample, and determine a number of teacher models that are pre-trained. The device for training the text classification model may be a server, or may be an electronic device such as a desktop computer, a notebook computer, or the like. For convenience of description, a training method of the text classification model provided in the present specification will be described below with only a server as an execution subject.

The text sample may be pre-collected text data, or may be sample data in any existing text data set. The labeling of the text sample is related to a text classification model, when the text classification model is used for determining whether the text samples are similar, the text samples can be pre-collected text input by a user in an intelligent customer service reply scene, the text input by the user can be related problems of the user on a transaction object, the user can be problems of information such as the size, the color, the thickness and the using method of the transaction object, and the problems of the user can be the text samples. For example, the text input by the user may be "what size" and the text input by the user may also be a question related to the transaction flow, and the user may be able to present text to the process information such as the entry, the start, the overall process, the transaction tool, and the end of the transaction flow, for example, the text input by the user may be "how to conduct a transaction". The text sample at least comprises two sentences, at least one sentence corresponds to a reply text in the text sample, and the reply text is a text which marks the text input by a user in advance. Corresponding labels of the text samples are similar or dissimilar, when the labels are similar, sentences in the text samples are similar, and when the labels are dissimilar, sentences in the text samples are dissimilar. In addition, when the text classification model is used for determining the topic of the text sample, the labels corresponding to the text sample are various topic types, and the specification is not particularly limited. For ease of illustration, the following description will be given by way of example of determining whether text samples are similar, and the text samples are labeled as one of similar and dissimilar.

In the present specification, the teacher model is a model trained in advance, and may be any existing text classification model, and the present specification is not limited specifically. The number of parameters of different teacher models is different, i.e. the number of parameters of each teacher model is different, but the purposes of each teacher model are the same, for example, the teacher model comprises a first model and a second model, the first model comprises 12 network layers, the second model comprises 24 network layers, and the number of parameters of the first model and the second model is different. Each teacher model may be a model of BERT (Bidirectional Encoder Representation from Transformers) structure, but may be a model of other structures, and the present specification is not limited specifically.

S102: and executing the following steps for each teacher model in sequence from small to large according to the parameter quantity of each teacher model: inputting the text sample into the teacher model, determining a pseudo-standard result, inputting the text sample into a student model to be trained, determining a classification result, and training the student model to be trained at least according to the pseudo-standard result and the classification result obtained based on the teacher model.

The server may execute the following procedure for each teacher model in order of decreasing parameter amount of each teacher model:

Inputting a text sample into the teacher model, determining a pseudo-standard result, inputting the text sample into a student model to be trained, determining a classification result, and training the student model to be trained at least according to the pseudo-standard result and the classification result obtained based on the teacher model.

The student model may be a model of BERT (Bidirectional Encoder Representation from Transformers) structure, among others. The number of parameters of the student model is smaller than that of parameters of all teacher models, for example, the student model comprises 4 layers of network layers, the classification result and the pseudo-label result are used for representing whether sentences in a text sample are similar or not, the classification result can be one of similarity and dissimilarity, and the pseudo-label result can be one of similarity and dissimilarity.

When training the student model to be trained according to at least the pseudo-label result and the classification result obtained based on the teacher model, the server may train the student model to be trained with at least a minimum difference between the pseudo-label result and the classification result obtained based on the teacher model as a target.

In the present specification, the server sequentially inputs texts into each teacher model in order of decreasing parameter amount of the teacher model, determines pseudo-standard results, and inputs text samples into student models to be trained, and determines classification results. And training the student model to be trained at least according to the pseudo-label result and the classification result obtained based on the teacher model. For example, the teacher model includes a first model and a second model, and the parameter amount of the first model is smaller than the parameter amount of the second model, so the server firstly inputs text into the first model for the first model, determines a pseudo-mark result, and inputs a text sample into the student model to be trained, and determines a classification result. And training the student model to be trained at least according to the pseudo-label result and the classification result obtained based on the first model. And inputting the text into the second model aiming at the second model, determining a pseudo-label result, inputting the text sample into the student model to be trained after the instruction of the first model, and determining a classification result. And training the student model to be trained after the first model guidance at least according to the pseudo-label result and the classification result obtained based on the second model.

In addition, in order to better train the student model and obtain the text classification model, the server can select a reference model from all teacher models, and initialize the model parameters of the student model to be trained according to the model parameters of the reference model. And training the initialized student model to be trained. The server may select the reference model from among the teacher models at random, or may select a model with the smallest parameter from among the teacher models as the reference model, which is not specifically limited in this specification. For example, when the reference model selected by the server includes 12 network layers and the student model to be trained includes 4 network layers, the server may initialize model parameters of the student model to be trained according to parameters of the first 4 network layers of the reference model.

S104: taking the trained student model as a text classification model; the text classification model is used for determining a classification result of the text to be classified according to the text to be classified.

The server may use the trained student model as a text classification model. The student model after training is a model obtained by training the teacher model with the number of parameters from small to large in the step S102, that is, a model obtained by training all the teacher models to guide the student model to be trained. The text classification model is used for determining a classification result of the text to be classified according to the text to be classified.

As can be seen from the above method, the server can determine text samples and determine a number of pre-trained teacher models when training text classification models. And inputting a text sample into the teacher model according to the sequence of the parameter quantity of each teacher model from small to large, determining a pseudo-standard result, inputting the text sample into a student model to be trained, determining a classification result, and training the student model to be trained at least according to the pseudo-standard result and the classification result which are obtained based on the teacher model. And then taking the trained student model as a text classification model. The student models are trained through a plurality of teacher guides, and the student models are progressively guided according to the sequence of the parameter quantities of the teacher models from small to large, so that the student models can learn more text representation of the teacher models, and the student models progressively learn the text representation of the teacher models, so that the representation capacity of the student models on the text is improved, and the student models are prevented from forgetting the text representation of the teacher models. And taking the trained student model as a text classification model, and improving the classification accuracy of the text classification model.

In this specification, a student model to be trained includes a feature extraction layer and a classification layer. Therefore, in the step S102, the text sample is input into the student model to be trained, and when determining the classification result, as shown in fig. 2, fig. 2 is a schematic diagram of a structure of the student model provided in the present specification, and the server may input the text sample into the feature extraction layer of the student model to be trained to determine the features corresponding to the text sample. And inputting the characteristics into a classification layer of the student model to be trained, and determining classification results.

In addition, the server can flexibly align the student model with different teacher models through independent characterization without considering the differences of the plurality of teacher models in characterization. Therefore, in step S102, the text sample is input into the student model to be trained, and when determining the classification result, the server may input the text sample into the feature extraction layer of the student model to be trained, to determine the feature sequence corresponding to the text sample. And taking the characteristic of the position corresponding to the teacher model in the characteristic sequence as an output characteristic. And inputting the output characteristics into a classification layer of the student model to be trained, and determining classification results.

As shown in fig. 3, fig. 3 is a schematic diagram of a feature sequence provided in the present specification, where the feature sequence in fig. 3 includes features corresponding to CLS positions, teacher positions, and text positions, respectively, the text positions are positions corresponding to each word or word in the text, the teacher positions have a correspondence with teacher models, and a plurality of teacher models guide student models to train, and how many teacher positions exist in the feature sequence. Typically, the CLS position is located at the first position of the feature sequence, and the teacher position may be located between the CLS position and the text position (i.e., positions T1 to Tn in fig. 3), and the teacher position may also be located at the last position of the feature sequence, i.e., at the rear of the text position, which is not specifically limited in this specification. Fig. 3 exemplifies that only the positions corresponding to two teacher models (i.e., the first teacher position and the second teacher position in fig. 3) may be between the CLS position and the other text positions. In addition, the characteristics corresponding to the CLS position and the teacher position are the characteristics representing the whole text. The classification result is determined based on the characteristics of the corresponding position of the teacher model in the characteristic sequence, so that the student model to be trained is trained subsequently at least according to the classification result and the pseudo-standard result obtained based on the teacher model.

In addition, in the step S102, when the text sample is input into the teacher model and the pseudo-label result is determined, the server may input the text sample into the teacher model, determine a feature sequence corresponding to the text sample, and use a feature corresponding to a specified position in the feature sequence as a text feature. And determining a pseudo-label result based on the text characteristics. The feature sequence determined based on the teacher model only comprises the CLS position and the feature corresponding to the CLS position, so that the designated position is the CLS position, and the text feature is the feature corresponding to the CLS position.

Based on this, in step S102, when the student model to be trained is trained based on at least the pseudo-label result and the classification result obtained based on the teacher model, the server trains the student model to be trained based on the pseudo-label result and the classification result obtained based on the teacher model, and may determine the first result based on the feature corresponding to the specified position in the feature sequence, and train the student model to be trained based on the first result and the label corresponding to the text sample. Specifically, the server may use the feature corresponding to the specified position in the feature sequence as the first feature. The first features are input into a classification layer of the student model to be trained, and a first result is determined. And determining the label corresponding to the text sample. And then training the student model to be trained according to the pseudo-label result, the classification result, the first result and the label which are obtained based on the teacher model. The specified position is a CLS position, and the first result is used for representing whether sentences in the text sample are similar or not, and is one of similarity and dissimilarity. The corresponding labels of the text samples are one of similar and dissimilar.

When training the student model to be trained according to the pseudo-label result, the classification result, the first result and the label obtained based on the teacher model, the server can train the student model to be trained with the minimum difference between the pseudo-label result and the classification result obtained based on the teacher model as a target and with the minimum difference between the first result and the label as a target. The server may also determine a first loss of task based on the first result and the annotation. And determining the second task loss according to the pseudo-label result and the classification result obtained based on the teacher model. And training the student model to be trained according to the first task loss and the second task loss.

Further, in order to better balance the first task loss and the second task loss, so that the student model can learn the text representation of the teacher model progressively, the server can weight the first task loss and the second task loss according to the designated weight when training the student model to be trained according to the first task loss and the second task loss. And training the student model to be trained according to the weighted first task loss and the weighted second task loss. The assigned weight may be a weight corresponding to each task loss set in advance, for example, the weights corresponding to the first task loss and the second task loss may be 1.

When the first task loss is determined according to the first result and the label, the server can calculate the cross entropy loss of the first result and the label and serve as the first task loss. When determining the second task loss based on the pseudo-label result and the classification result obtained based on the teacher model, the server may calculate the relative entropy of the pseudo-label result and the classification result obtained based on the teacher model, that is, KL divergence (Kullback-Leibler divergence), and use the calculated relative entropy as the second task loss.

In this specification, if the teacher model is the teacher model with the smallest parameter amount in each teacher model, when the teacher model is used to guide the student model, the server may train the student model to be trained according to the pseudo-standard result and the classification result obtained based on the teacher model, and the server may train the student model to be trained according to the pseudo-standard result, the classification result, the first result and the label obtained based on the teacher model, where the specific process is identical to the process in the step S102, and will not be repeated herein.

However, if the teacher model is not the teacher model with the smallest parameter amount among the respective teacher models, that is, there is a teacher model with a smaller parameter amount than the teacher model, and the teacher model with a smaller parameter amount than the teacher model is trained on the student model to be trained before the teacher model. Therefore, in the step S102, when the student model to be trained is trained based on at least the pseudo-label result and the classification result obtained based on the teacher model, the server may determine other teacher models according to the parameter of the teacher model. And taking the features of the positions corresponding to the other teacher models in the feature sequence as second features. And inputting the second features into a classification layer of the student model to be trained, and determining a second result. And determining the pseudo-label results corresponding to other teacher models and taking the pseudo-label results as other results. Training the student model to be trained at least according to the pseudo-label result, the classification result, the second result and other results obtained based on the teacher model, so as to avoid the student model from forgetting text characterization learned from other teacher models. The parameter quantity of the other teacher models is smaller than that of the teacher model, so that the other teacher models train the student models to be trained before the teacher model, the server can directly determine the pseudo-standard results of the other teacher models and serve as other results, and the pseudo-standard results of the other teacher models are camouflage results obtained when the other teacher models guide the student models to be trained to train. The second result is used to characterize whether sentences in the text sample are similar, which may be one of similar and dissimilar.

When the student model to be trained is trained according to at least the pseudo-target result, the classification result, the second result and other results obtained based on the teacher model, the server may train the student model to be trained with at least the minimum difference between the pseudo-target result and the classification result obtained based on the teacher model and the minimum difference between the second result and other results. The server may also determine a second task loss based on the pseudo-objective result and the classification result obtained based on the teacher model. And determining a third task loss according to the second result and other results. And training the student model to be trained according to at least the second task loss and the third task loss. The process of determining the third task loss according to the second result and other results is similar to the process of determining the second task loss according to the pseudo-label result and the classification result obtained based on the teacher model, and will not be described herein.

Further, to better balance the second task loss and the third task loss, so that the student model can learn the text representation of the teacher model progressively, the server can weight the second task loss and the third task loss according to the specified weights, respectively, when training the student model to be trained according to at least the second task loss and the third task loss. And training the student model to be trained at least according to the weighted second task loss and the weighted third task loss.

In addition, when training the student model to be trained based on at least the pseudo-label result, the classification result, the second result, and other results obtained based on the teacher model, the server may use the feature corresponding to the specified position in the feature sequence as the first feature. The first features are input into a classification layer of the student model to be trained, and a first result is determined. And determining the label corresponding to the text sample. And training the student model to be trained according to the pseudo-standard result, the classification result, the second result, other results, the first result and the labels which are obtained based on the teacher model.

When the student model to be trained is trained according to the pseudo-target result, the classification result, the second result, the other results, the first result and the labels obtained based on the teacher model, the server can train the student model to be trained with the minimum difference between the pseudo-target result and the classification result obtained by the teacher model, with the minimum difference between the second result and the other results and with the minimum difference between the first result and the labels. The server may also determine a first loss of task based on the first result and the annotation. And determining the second task loss according to the pseudo-label result and the classification result obtained based on the teacher model. And determining a third task loss according to the second result and other results. And training the student model to be trained according to the first task loss, the second task loss and the third task loss.

Further, in order to better balance the first task loss, the second task loss and the third task loss, the student model may learn the text representation of each teacher model progressively, where the server may weight the first task loss, the second task loss and the third task loss according to the specified weights when training the student model to be trained according to the first task loss, the second task loss and the third task loss. And training the student model to be trained according to the weighted first task loss, the weighted second task loss and the weighted third task loss.

In the present specification, each of the teacher models is a model trained by a text sample and a label corresponding to the text sample. Thus, when training several teacher models in advance, the server determines the labels corresponding to the text samples, and then. Aiming at each teacher model to be trained, training the teacher model to be trained based on the text sample and the labels. Specifically, when training the teacher model to be trained based on the text sample and the label, the server may input the text sample into the teacher model to be trained, and determine an output result. And training the teacher model to be trained by taking the minimum difference between the output result and the label as a target.

In the present specification, after determining the text sample in step S100 described above, the server may determine the initial student model and determine the callout to which the text sample corresponds. And then training the initial student model based on the text sample and the label to obtain a student model to be trained. The model parameters of the initial student model may be initialized according to the model parameters of the teacher model, and the specific process is as described in step S102, which is not described herein. The student model to be trained may be a model trained by a text sample and a label, but when the initial student model is trained by the text sample and the label, the server takes the initial student model which is not fully converged as the student model to be trained, that is, takes the initial student model which is not fully trained as the student model to be trained. Therefore, the server can train the initial student model according to the output result and the label until the appointed times are reached, and the initial student model after the last training is used as the student model to be trained.

In addition, when one teacher model is used to guide the training of the student model to be trained in the step S102, that is, when the teacher model with the largest parameter is used for training the student model to be trained in the step S102, the server may not train the student model to be trained to be fully converged when the teacher model with the smallest parameter is used to guide the training of the student model to be trained, except for the teacher model with the largest parameter. Specifically, the server may only use the teacher model with the smallest parameter to guide the training of the student model to be trained for a preset number of times, and when the preset number of times is reached, use the next teacher model of the teacher model with the smallest parameter to guide the training of the student model to be trained. The preset times are training times preset by the server. However, when training a student model to be trained using a teacher model with the largest number of parameters, the server needs to train the student model to be trained to full convergence. In particular, how to determine when the student model to be trained is completely converged, the server may set an end condition, and when the student model to be trained meets the end condition, determine that the student model to be trained is completely converged, and then in step S104, the server may use the trained student model (i.e., the completely converged student model to be trained) as a text classification model. The end condition may be that the training number of the student model to be trained reaches a preset threshold, where the preset threshold may be a value preset by the server. The end condition may also be similar to the output results of two student models to be trained continuously, and of course, the end condition may also be any other end condition that the existing determined model is completely converged, which is not specifically limited in this specification.

In this specification, after obtaining the text classification model, the server may determine the text to be classified. Inputting the text to be classified into a text classification model, and determining the classification result of the text to be classified. And classifying the text to be classified according to the classification result. The text to be classified can comprise at least two sentences, and the classification result is one of similarity and dissimilarity.

In the present specification, in the intelligent customer service reply scenario, after obtaining the text classification model, the server may determine the input text of the user in response to the input operation of the user. And determining a pre-stored standard text, and taking the standard text and the input text as texts to be classified. And inputting the text to be classified into a text classification model, and determining a classification result of the text to be classified. And when the classification result is similar, determining a reply text corresponding to the standard text, and displaying the reply text to the user. The standard text has a corresponding reply text, for example, the standard text is "what size" and the reply text corresponding to the standard text may be "37" size. The standard text may be a text which is collected by the server in advance and is input by the user in history, and the reply text corresponding to the standard text may be marked in advance for an operator, or may be collected in advance by the server, which is not particularly limited in this specification. When the standard text and the input text are used as the text to be classified, the server can splice the standard text and the input text, and the spliced text is used as the text to be classified. The classification result characterizes whether the input text and the standard text are similar, the classification result being one of similar and dissimilar.

When the text to be classified is input into the text classification model and the classification result of the text to be classified is determined, the server can input the text to be classified into the feature extraction layer of the text classification model and determine the feature sequence. And determining the characteristics corresponding to the designated positions in the characteristic sequence, and taking the characteristics as designated characteristics, and determining the characteristics of the positions corresponding to the teacher models in the characteristic sequence, and taking the characteristics as the characteristics of each teacher. And fusing the teacher features and the appointed features to obtain the classification features. And inputting the classification features into a classification layer of the text classification model to determine classification results.

And when the classification results are dissimilar, the server can redetermine the standard text, splice the redetermined standard text with the input text to obtain a new text to be classified, and continuously determine the classification results until the classification results are similar.

In the present specification, the initial student model, the student model to be trained, and the model structure of the text classification model are the same, but the models obtained in different training stages. These student models may include, in addition to the feature extraction layer and classification layer, a global averaging pooling layer, the Pooler layer, which is used for global averaging pooling of features. Specifically, in step S102, the output features are input to the classification layer of the student model to be trained, and the process of determining the classification result is taken as an example, and the server may input the output features to the Pooler layer of the student model to be trained, so as to obtain pooled features. And inputting the pooled features into a classification layer of the student model to be trained, and determining classification results.

The above training method for the text classification model provided for one or more embodiments of the present disclosure further provides a training device for a corresponding text classification model based on the same thought, as shown in fig. 4.

Fig. 4 is a schematic diagram of a training device for a text classification model provided in the present specification, specifically including:

a first determining module 200 for determining a text sample and determining a number of teacher models trained in advance; wherein, the parameters of each teacher model are different;

the first training module 202 is configured to sequentially execute, for each teacher model, in order from the smaller parameter to the larger parameter of each teacher model: inputting the text sample into the teacher model, determining a pseudo-standard result, inputting the text sample into a student model to be trained, determining a classification result, and training the student model to be trained at least according to the pseudo-standard result and the classification result obtained based on the teacher model;

a second determining module 204, configured to use the trained student model as a text classification model; the text classification model is used for determining a classification result of the text to be classified according to the text to be classified.

the first training module 202 is specifically configured to input the text sample into a feature extraction layer of a student model to be trained, and determine a feature sequence corresponding to the text sample; taking the characteristics of the position corresponding to the teacher model in the characteristic sequence as output characteristics; and inputting the output characteristics into a classification layer of the student model to be trained, and determining classification results.

Optionally, the first training module 202 is specifically configured to take a feature corresponding to a specified position in the feature sequence as a first feature; inputting the first characteristics into a classification layer of the student model to be trained, and determining a first result; determining a label corresponding to the text sample; and training the student model to be trained according to the pseudo-label result, the classification result, the first result and the label which are obtained based on the teacher model.

Optionally, the first training module 202 is specifically configured to determine a first task loss according to the first result and the label; determining a second task loss according to a pseudo-label result and the classification result obtained based on the teacher model; and training the student model to be trained according to the first task loss and the second task loss.

Optionally, the first training module 202 is specifically configured to determine other teacher models according to the parameter amounts of the teacher models; wherein the parameter amount of the other teacher model is smaller than the parameter amount of the teacher model; taking the features of the positions corresponding to the other teacher models in the feature sequence as second features; inputting the second features into a classification layer of the student model to be trained, and determining a second result; determining pseudo-label results corresponding to the other teacher models and taking the pseudo-label results as other results; and training the student model to be trained at least according to the pseudo-standard result, the classification result, the second result and the other results obtained based on the teacher model.

Optionally, the first training module 202 is specifically configured to take a feature corresponding to a specified position in the feature sequence as a first feature; inputting the first characteristics into a classification layer of the student model to be trained, and determining a first result; determining a label corresponding to the text sample; and training the student model to be trained according to the pseudo-label result, the classification result, the second result, the other results, the first result and the label obtained based on the teacher model.

Optionally, the first training module 202 is specifically configured to determine a first task loss according to the first result and the label; determining a second task loss according to a pseudo-label result and the classification result obtained based on the teacher model; determining a third task loss according to the second result and the other results; and training the student model to be trained according to the first task loss, the second task loss and the third task loss.

Optionally, the first training module 202 is specifically configured to weight the first task loss, the second task loss, and the third task loss according to an assigned weight, respectively; and training the student model to be trained according to the weighted first task loss, the weighted second task loss and the weighted third task loss.

Optionally, the apparatus further comprises:

a second training module 206, configured to determine a label corresponding to the text sample; and training each teacher model to be trained based on the text sample and the label.

Optionally, the first determining module 200 is further configured to determine an initial student model after determining a text sample, and determine a label corresponding to the text sample; and training the initial student model based on the text sample and the label to obtain a student model to be trained.

Optionally, the apparatus further comprises:

an application module 208 for determining an input text of a user in response to an input operation of the user; determining a pre-stored standard text; taking the standard text and the input text as texts to be classified; inputting the text to be classified into the text classification model, and determining a classification result of the text to be classified; and when the classification result is similar, determining a reply text corresponding to the standard text, and displaying the reply text to the user.

The present specification also provides a computer readable storage medium storing a computer program operable to perform the above-described training method of the text classification model shown in fig. 1.

The present specification also provides a schematic diagram of the electronic device shown in fig. 5. As shown in fig. 5, fig. 5 is a schematic diagram of an electronic device corresponding to fig. 1 provided in the present specification, and the electronic device includes a processor, an internal bus, a network interface, a memory, and a nonvolatile memory, and may include hardware required by other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to implement the training method of the text classification model shown in fig. 1.

Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A method of training a text classification model, comprising:

2. The method of claim 1, the student model to be trained comprising a feature extraction layer and a classification layer;

3. The method of claim 2, training the student model to be trained based at least on the pseudo-label result obtained based on the teacher model and the classification result, specifically comprising:

determining a label corresponding to the text sample;

4. The method of claim 3, training the student model to be trained according to the pseudo-label result, the classification result, the first result and the label obtained based on the teacher model, specifically comprising:

determining a first task loss according to the first result and the label;

5. The method of claim 2, training the student model to be trained based at least on the pseudo-label result obtained based on the teacher model and the classification result, specifically comprising:

6. The method according to claim 5, wherein training the student model to be trained based on at least the pseudo-label result, the classification result, the second result, and the other results based on the teacher model specifically comprises:

determining a label corresponding to the text sample;

7. The method of claim 6, training the student model to be trained according to the pseudo-label result, the classification result, the second result, the other results, the first result and the label obtained based on the teacher model, specifically comprising:

Determining a first task loss according to the first result and the label;

8. The method according to claim 7, training the student model to be trained according to the first task loss, the second task loss and the third task loss, specifically comprising:

9. The method of claim 1, pre-training a plurality of teacher models, comprising:

determining a label corresponding to the text sample;

10. The method of claim 1, after determining the text sample, the method further comprising:

11. The method of claim 1, the method further comprising:

determining input text of a user in response to input operation of the user;

determining a pre-stored standard text;

taking the standard text and the input text as texts to be classified;

12. A training device for a text classification model, comprising:

13. The apparatus of claim 12, the student model to be trained comprising a feature extraction layer and a classification layer;

14. The apparatus of claim 13, wherein the first training module is specifically configured to take a feature corresponding to a specified position in the feature sequence as a second feature; inputting the second features into a classification layer of the student model to be trained, and determining a second result; determining a label corresponding to the text sample; and training the student model to be trained according to the pseudo-label result, the classification result, the second result and the label obtained based on the teacher model.

15. The apparatus of claim 14, the first training module being specifically configured to determine a first loss of task based on the second result and the annotation; determining a second task loss according to a pseudo-label result and the classification result obtained based on the teacher model; and training the student model to be trained according to the first task loss and the second task loss.

16. The apparatus of claim 13, wherein the first training module is specifically configured to determine other teacher models according to the parameter amounts of the teacher models; wherein the parameter amount of the other teacher model is smaller than the parameter amount of the teacher model; taking the features of the positions corresponding to the other teacher models in the feature sequence as first features; inputting the first characteristics into a classification layer of the student model to be trained, and determining a first result; determining pseudo-label results corresponding to the other teacher models and taking the pseudo-label results as other results; and training the student model to be trained at least according to the pseudo-standard result, the classification result, the first result and the other results obtained based on the teacher model.

17. The apparatus of claim 16, wherein the first training module is specifically configured to take a feature corresponding to a specified position in the feature sequence as a second feature; inputting the second features into a classification layer of the student model to be trained, and determining a second result; determining a label corresponding to the text sample; and training the student model to be trained according to the pseudo-label result, the classification result, the first result, the other results, the second result and the label obtained based on the teacher model.

18. The apparatus of claim 17, the first training module being specifically configured to determine a first loss of task based on the second result and the annotation; determining a second task loss according to the first result and the other results; determining a third task loss according to a pseudo-label result and the classification result obtained based on the teacher model; and training the student model to be trained according to the first task loss, the second task loss and the third task loss.

19. The apparatus of claim 18, the first training module being specifically configured to weight the first task loss, the second task loss, and the third task loss according to assigned weights, respectively; and training the student model to be trained according to the weighted first task loss, the weighted second task loss and the weighted third task loss.

20. The apparatus of claim 12, the apparatus further comprising:

21. The apparatus of claim 12, the first determining module, after determining a text sample, further to determine an initial student model, and to determine a callout corresponding to the text sample; and training the initial student model based on the text sample and the label to obtain a student model to be trained.

22. The apparatus of claim 12, the apparatus further comprising:

23. A computer readable storage medium storing a computer program which, when executed by a processor, implements the method of any one of the preceding claims 1 to 11.

24. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding claims 1-11 when the program is executed.