CN111382271B

CN111382271B - Training method and device of text classification model, text classification method and device

Info

Publication number: CN111382271B
Application number: CN202010156375.2A
Authority: CN
Inventors: 刘俊宏; 马良庄; 张望舒; 温祖杰
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2023-05-23
Anticipated expiration: 2040-03-09
Also published as: CN111382271A

Abstract

The embodiment of the specification provides a training method of a text classification model, which comprises the following steps: firstly, N original texts and N corresponding text class labels are obtained, wherein N is a positive integer greater than 1; then, splicing the N original texts to obtain spliced texts; then, performing single-heat coding on the N text category labels to obtain N category label vectors; then, carrying out average processing on the N category label vectors to obtain a comprehensive label vector; then, inputting the spliced text into a text classification model to obtain a comprehensive classification result; and training the text classification model based on the comprehensive classification result and the comprehensive label vector. In addition, the embodiment of the specification also provides a text classification method, which comprises the following steps: and obtaining target texts to be classified, copying the target texts to obtain N target texts, splicing the N target texts, and inputting the N target texts into a text classification model obtained by the training method to obtain a text classification result of the target texts.

Description

Training method and device of text classification model, text classification method and device

Technical Field

One or more embodiments of the present disclosure relate to the field of natural language processing, and in particular, to a training method and apparatus for a text classification model, and a text classification method and apparatus.

Background

In many scenarios, text classification is involved. For example, in a web forum, posts posted by users need to be categorized to be presented in forum sections of corresponding categories (e.g., family emotion). With the development of machine learning, training a machine learning model for classifying texts becomes a research hotspot.

However, the accuracy of text classification results currently obtained using machine learning models is limited. Therefore, a reasonable scheme is needed, and the accuracy of the text classification result can be effectively improved.

Disclosure of Invention

One or more embodiments of the present disclosure describe a text classification method and apparatus, which introduces the idea of mixup (a data enhancement scheme) applied to the field of picture processing into text classification, so as to implement enhancement of text data, thereby improving accuracy of text classification results.

According to a first aspect, there is provided a training method of a text classification model, comprising: acquiring N original texts and N corresponding text class labels, wherein N is a positive integer greater than 1; splicing the N original texts to obtain spliced texts; performing single-heat coding on the N text category labels to obtain N category label vectors; carrying out average processing on the N category label vectors to obtain a comprehensive label vector; inputting the spliced texts into a text classification model to obtain comprehensive classification results aiming at the N original texts; training the text classification model based on the comprehensive classification result and the comprehensive label vector.

In one embodiment, training the text classification model based on the comprehensive classification result and the comprehensive label vector includes: determining a cross entropy loss based on the comprehensive classification result and the comprehensive label vector; and adjusting model parameters in the text classification model by utilizing the cross entropy loss.

In one embodiment, the N original texts are historical user session texts collected in a customer service scene, the N text category labels are standard question categories or standard question category identifications, and the text classification model is a question prediction model.

In one embodiment, the text classification model is based on a deep neural network DNN, a recurrent neural network RNN, a long short term memory network LSTM, a transducer model, or a Bert model.

According to a second aspect, there is provided a text classification method comprising: acquiring a target text to be classified; copying the target texts to obtain N target texts; and after the N target texts are spliced, inputting a text classification model trained by the method provided in the first aspect, and obtaining a text classification result aiming at the target texts.

According to a third aspect, there is provided a training device for a text classification model, comprising: the acquiring unit is configured to acquire N original texts and N corresponding text category labels, wherein N is a positive integer greater than 1; the splicing unit is configured to splice the N original texts to obtain spliced texts; the coding unit is configured to perform single-heat coding on the N text category labels respectively to obtain N category label vectors; the averaging unit is configured to average the N category label vectors to obtain a comprehensive label vector; the prediction unit is configured to input the spliced text into a text classification model to obtain comprehensive classification results aiming at the N original texts; and the training unit is configured to train the text classification model based on the comprehensive classification result and the comprehensive label vector.

According to a fourth aspect, there is provided a text classification apparatus comprising: an acquisition unit configured to acquire a target text to be classified; the copying unit is configured to copy the target texts to obtain N target texts; and the prediction unit is configured to splice N target texts, and then input a text classification model trained by the device according to the third aspect to obtain a text classification result aiming at the target texts.

According to a fifth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first or second aspect.

According to a sixth aspect, there is provided a computing device comprising a memory having executable code stored therein and a processor which when executing the executable code implements the method of the first or second aspect.

In summary, by adopting the training method and the text classification method for the text classification model provided by the embodiment of the specification, massive training data can be constructed without literally processing text contents, so that the information of the original text is reserved, all problems faced by the text enhancement based on synonym substitution are avoided, the model performance of the text classification model can be effectively improved, and the accuracy, the reliability and the usability of a prediction result are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a training process block diagram of a text classification model according to one embodiment;

FIG. 2 illustrates a block diagram of a process of using a text classification model according to an embodiment;

FIG. 3 illustrates a training method flow diagram for a text classification model according to one embodiment;

FIG. 4 illustrates a text classification method flow diagram according to an embodiment;

FIG. 5 illustrates an example diagram of a text classification method embodiment according to an example;

FIG. 6 illustrates a training device architecture diagram of a text classification model according to one embodiment;

fig. 7 illustrates a block diagram of a text classification device according to an embodiment.

Description of the embodiments

The following describes the scheme provided in the present specification with reference to the drawings.

In the field of machine learning, to improve model performance of a machine learning model (or prediction model), data enhancement techniques may be employed. Specifically, based on the collected training data, a virtual sample can be constructed through a data enhancement technology and used for enriching the training data, so that the prediction performance of a prediction model is improved, and meanwhile, the accuracy and usability of a prediction result are improved.

At present, the enhancement of text data is mainly text enhancement based on synonym replacement, the enhancement mode needs to manually specify a synonym table, and replacement errors caused by synonym table errors or word multi-meaning exist, so that the enhancement of the text classification model performance of the text data obtained through the enhancement mode is limited.

Furthermore, the image data enhancement technique of mixup has achieved a better effect in the task of image classification, but the technique of mixup has not been applied in text. For this reason, a central problem is that text is not linearly interpolated like an image, for example, the average of two pixels is also one pixel, but two words do not have an average word representing the median of two words, which is a necessary result of the discontinuity of text.

Based on the observation and analysis, the inventor refers to the thought of the mixup technology, and provides a text classification method, in the method, massive training data can be constructed without literally processing text contents, so that the information of an original text is reserved, all problems faced by the text enhancement based on synonym substitution are avoided, the model performance of a text classification model can be effectively improved, and the accuracy, the reliability and the usability of a prediction result are improved.

In particular, in the above text classification method, a training phase and/or a use phase of the text classification model are involved. For the training phase, in one embodiment, fig. 1 shows a block diagram of a training process of a text classification model according to one embodiment, as shown in fig. 1, in the model training phase, first, according to a predetermined number N (N is a positive integer greater than or equal to 2), N original annotation samples are selected from an original annotation data set, where each original annotation sample includes a corresponding original text and a text class label; then, on one hand, splicing N original texts included in N original labeling samples to obtain spliced texts, inputting the spliced texts into a text classification model to obtain a prediction result, on the other hand, respectively performing independent-heat coding on N class labels to correspondingly obtain N label vectors, and performing average processing on the N label vectors to obtain an average vector; model parameters in the text classification model are then adjusted based on the prediction results and the average vector. Thus, by repeating the flow shown in fig. 1, multiple iterations may be performed on the text classification model until the model converges, resulting in a final used text classification model.

Based on the text classification model obtained through training, text classification can be achieved. Specifically, for the model use stage, in one embodiment, fig. 2 shows a block diagram of a use process of a text classification model according to one embodiment, as shown in fig. 2, in the model use stage, first, a target text to be classified is obtained, then, N target texts are spliced into target spliced texts according to the predetermined number N, and then, the target spliced texts are input into a trained text classification model, so as to obtain a text classification result of the target text. In this way, a more accurate classification result for the target text can be obtained.

The following describes specific implementation steps of the above model training method and usage method in connection with specific embodiments. In particular, fig. 3 shows a flowchart of a training method of a text classification model according to an embodiment, and the execution subject of the method may be any apparatus, device, platform, or cluster of devices with computing and processing capabilities. As shown in fig. 3, the method comprises the steps of:

step S310, N original texts and N corresponding text class labels are obtained, wherein N is a positive integer greater than 1; step S320, splicing the N original texts to obtain spliced texts; step S330, performing single-heat coding on the N text category labels to obtain N category label vectors; step S340, carrying out average processing on the N category label vectors to obtain a comprehensive label vector; step S350, inputting the spliced texts into a text classification model to obtain comprehensive classification results aiming at the N original texts; step S360, training the text classification model based on the comprehensive classification result and the comprehensive label vector.

The steps are as follows:

first, in step S310, N original texts and corresponding N text category labels are acquired. Wherein N is a positive integer greater than 1, a predetermined value, specifically, 2 or 3, etc.

It should be noted that, the original text and the text category label may correspond to any text classification scene. In one embodiment, where the original text may be a historical user session collected in a customer service scenario, the text category label may be a standard question category or a standard question category identification, and may be obtained by online collection (e.g., user feedback) or manual labeling. It should be understood that the standard questions generally refer to standard questions summarized for the user's high frequency questions, simply referred to as questions. In a specific embodiment, the standard question category may be a text corresponding to the standard question, for example, the original text "how to turn on" corresponds to the standard question category "how to turn on the flower" as follows. In a specific embodiment, where the standard issue category identification is used to uniquely identify the standard issue category, it may be composed of numbers and/or letters. In one example, the standard problem category identification may be a number, say, 3 standard problem categories, and then the 3 categories may be numbered 1, 2, and 3, respectively.

In another embodiment, where the original text may be content information text, the corresponding text category label may be an information category or information category identification. In a specific embodiment, the original text and information categories may be collected in a news website or content recommendation platform. In one example, the original text may include a news draft guide "hot citizen is ambiguous" and the corresponding information category may be "social news".

The sources and contents of the original text and text category labels are described above. Accordingly, it may be obtained, and in one embodiment, N annotation samples may be selected from an original annotation dataset comprising a plurality of original text and text category labels, by a predetermined number N, including N original text and N text category labels.

Above, N original texts and corresponding N text category labels can be obtained. Next, through steps S320 to S340, the obtained N original texts and N text category labels are processed into one training sample for the text classification model.

Specifically, in one aspect, in step S320, the N original texts are spliced to obtain a spliced text. It should be noted that, N original texts may be spliced in any order, that is, there is no requirement on the splicing order.

In one embodiment, this step may include: and respectively determining text vectors corresponding to the N original texts to obtain N text vectors, and then splicing the N text vectors to obtain spliced vectors corresponding to the spliced texts. In a specific embodiment, the determining the text vector corresponding to the original text may be implemented by word segmentation, word embedding, etc., and may be specifically referred to the related art, which is not described herein.

In another embodiment, the step may include: firstly, preprocessing a first original text in N original texts to obtain a first preprocessed text with a preset character number (such as 20 or 30), and then splicing N preprocessed texts corresponding to the N original texts obtained by preprocessing to obtain the spliced text. In a specific embodiment, wherein the preprocessing may include: when the number of characters of the first original text is smaller than the preset number of characters, filling preset characters (such as 0) in order to obtain a first preprocessed text; when the number of characters of the first original text is larger than the preset number of characters, the first original text is intercepted, and only the text with the preset number of characters is reserved to serve as a first preprocessing text.

And splicing the N original texts to obtain spliced texts.

On the other hand, in step S330, the N text category labels are separately subjected to one-hot encoding, so as to obtain N category label vectors. In step S340, the N class label vectors are averaged to obtain a comprehensive label vector.

It should be appreciated that One-Hot encoding, also known as One-bit valid encoding, uses an M-bit state register to encode M states, each with its own independent register bit, and only One of the bits is valid at any time. In one embodiment, where the text category labels have a total of L categories, the text category labels may be encoded as L-dimensional vectors, where the values of one dimension are different from the values of the remaining dimensions. In one example, assuming that the text category labels are text category numbers, including 1, 2, and 3 altogether, these 3 category labels may be encoded as (1, 0), (0, 1, 0), and (0, 1) in sequence. Thus, the N text class labels are subjected to single-hot encoding respectively, and N class label vectors can be obtained.

Further, the N class label vectors may be averaged to obtain a comprehensive label vector. Here, the reason why the averaging process is selected is that the original text is equally positioned in the spliced text, and therefore, the text category labels of different original texts are assigned the same value in the integrated label vector. It should be appreciated that the integrated tag vector may indicate a classification result for the spliced text, or may indicate an integrated classification result for the N original texts.

According to a specific example, assuming that the text category labels have 4 total categories, N is 2, and 2 text category labels are 2 and 4, respectively, the 2 text category labels may be encoded as (0, 1, 0) and (0, 1), respectively; further, the two kinds of label vectors are averaged, and a comprehensive label vector (0,0.5,0,0.5) can be obtained.

According to another specific example, assuming that the text category labels have 3 total categories, N is 2, and 2 text category labels are 1 and 1, respectively, the 2 text category labels may be encoded as (1, 0) and (1, 0), respectively; further, the two kinds of label vectors are subjected to average processing, so that the comprehensive label vector is (1, 0).

In this way, a comprehensive label vector corresponding to the N text category labels can be obtained.

The spliced text and the comprehensive label vector can be obtained to form a training sample. Based on this, in step S350, the spliced text is input into a text classification model, and a comprehensive classification result for the N original texts is obtained.

In one embodiment, the original text is a historical user session collected in a customer service scene, the text classification label is a question classification, and accordingly, the text classification model may be a question prediction model. In another embodiment, the original text is content information text, the text classification label is information category, and accordingly, the text classification model may be an information category prediction model.

On the other hand, in one embodiment, the text classification model may be based on an artificial neural network, a decision tree algorithm, a bayesian algorithm, or the like. In a specific embodiment, the text classification model may be based on DNN (Deep Neural Network ), a transducer model, bert (Bidirectional Encoder Representations from Transformers, bi-directional coded representation of transducers) model, RNN (Recurrent Neural Network ), LSTM (Long Short-Term Memory network), etc.

Thus, the spliced text is input into the text classification model, and comprehensive classification results for N original texts can be obtained. In one embodiment, the comprehensive classification result may be a classification result vector, where the vector elements included are probabilities that the spliced text belongs to each text category. In one example, assuming a total of 5 classes for the text class, the overall classification result may be (0.1,0.6,0.1,0.1,0.1), whereby the classification result vector may be found to have a probability of 0.1,0.6,0.1,0.1 and 0.1, respectively, for the spliced text belonging to class 1-5. In another embodiment, the comprehensive classification result may be a category to which the spliced text belongs, specifically, the category is: the spliced text belongs to the category corresponding to the maximum value in the probabilities of the categories. In one example, assuming that the probabilities of the resulting spliced text belonging to categories 1-5 are 0.8, 0.05, 0.04, 0.05, and 0.06, respectively, it is possible to determine that the spliced text belongs to category 1, and determine (1, 0) as the comprehensive classification result.

The comprehensive label vector and the comprehensive classification result can be obtained respectively. Then, in step S360, the text classification model is trained for classification of target text based on the comprehensive classification result and the comprehensive label vector.

In one embodiment, the model parameters may be adjusted for the text classification model using the composite classification result, the composite tag vector, and the preselected penalty function. In a specific embodiment, where the loss function may be a cross entropy loss function, the step may include: determining a cross entropy loss based on the comprehensive classification result and the comprehensive label vector; and adjusting model parameters in the text classification model by utilizing the cross entropy loss. In addition, the specific parameter adjusting mode can be referred to the prior art, and will not be described in detail.

Above, by executing step S310 to step S360, iterative training of the text classification model can be achieved, and by repeating the above steps S310 to S360 a plurality of times until the model converges, the finally trained text classification model can be obtained.

In summary, by adopting the training method of the text classification model provided by the embodiment of the specification, massive training data can be constructed without literally processing text contents, so that the information of the original text is reserved, all problems faced by the text enhancement based on synonym substitution are avoided, and the model performance of the text classification model can be effectively improved.

It should be noted that, after the trained text classification model is obtained above, it may be used to implement classification of the target text. In particular, fig. 4 shows a flow chart of a text classification method according to an embodiment, the execution subject of which may be any apparatus, device, platform, cluster of devices with computing, processing capabilities. As shown in fig. 4, the method comprises the steps of:

step S410, obtaining a target text to be classified; step S420, copying the target texts to obtain N target texts; step S430, after the N target texts are spliced, inputting a text classification model trained by the method shown in fig. 3, to obtain a text classification result for the target text.

For the above steps, in one embodiment, for the target text to be classified, N text bits at the input end of the text classification model may be set as the target text at the same time, so that classification prediction for the target text may be implemented.

From the foregoing, it can be seen that the text classification model used therein has excellent model performance, and thus, the text classification result obtained in the above-described text classification method has higher accuracy and reliability.

The text classification method is described below in conjunction with a specific example, including training and use of models. Fig. 5 shows an example diagram of an embodiment of a text classification method according to an example, and as shown in fig. 5, the categories of text classification are 5 in total, and the table shows a combination of 2 (n=2) examples as one example. In the training stage of the classifier (or called text classification model), a text bit I and a text bit II can be respectively set as texts in a sample 1 and a sample 2 to obtain a text classification result, wherein the text classification result comprises spliced texts corresponding to a sample combination (1 and 2), the probability of the spliced texts belongs to each category, and parameters in the classifier are adjusted by utilizing the text classification result and comprehensive label vectors (0,0.5,0.5,0,0) corresponding to the sample combination (1 and 2). Or after obtaining the text classification results corresponding to the sample combinations (1 and 2) and the sample combinations (1 and 1), performing parameter adjustment on the classifier by using the two text classification results and two comprehensive label vectors corresponding to the two combinations. Training of the classifier can be achieved in this way. Further, in the using stage of the classifier, aiming at the text to be predicted (or the target text to be classified), the text bit I and the text bit II at the input end of the classifier are set as the text to be predicted at the same time, and a classification result of the text to be predicted is obtained.

Further, by nature, regularization is added to the effect of the model by the text classification method, the generalization capability of the model is improved, and therefore the text classification effect is obviously improved. For regularization, taking n=2 as an example, the space between two samples A, B in the model is a very complex singular shape, and the text classification method introducing the mixup idea can be used to specify A, B that the midpoints of the two samples must be mapped to the 1/2a+1/2B label, which adds a constraint to training of the model.

The effect test is actually carried out on the text classification method, 10000 pieces of customer service robot marking data are used as training data, and 25000 pieces of customer service robot marking data are used as test data. Scheme 1: the text classification method introducing the mixup idea is adopted on the training set; scheme 2: traditional classification methods. Compared with the model trained by the scheme 2, the accuracy of the model trained by the scheme 1 on the test set is 5.2% higher, and the model trained by the scheme 1 is greatly improved.

Corresponding to the training method and the classifying method, the embodiment of the specification also discloses a training device and a classifying device. The method comprises the following steps:

FIG. 6 illustrates a training apparatus architecture diagram of a text classification model according to one embodiment, as illustrated in FIG. 6, the training apparatus 600 includes:

an obtaining unit 610, configured to obtain N original texts and N corresponding text class labels, where N is a positive integer greater than 1; a splicing unit 620, configured to splice the N original texts to obtain a spliced text; the encoding unit 630 is configured to perform single-hot encoding on the N text category labels to obtain N category label vectors; an averaging unit 640 configured to perform an averaging process on the N class label vectors to obtain a comprehensive label vector; the prediction unit 650 is configured to input the spliced text into a text classification model to obtain comprehensive classification results for the N original texts; and a training unit 660 configured to train the text classification model based on the comprehensive classification result and the comprehensive label vector.

In one embodiment, the training unit 660 is specifically configured to: determining a cross entropy loss based on the comprehensive classification result and the comprehensive label vector; and adjusting model parameters in the text classification model by utilizing the cross entropy loss.

Fig. 7 shows a block diagram of a text classification apparatus according to an embodiment, and as shown in fig. 7, the classification apparatus 700 includes:

an acquisition unit 710 configured to acquire a target text to be classified; a copying unit 720, configured to copy the target text to obtain N target texts; and the prediction unit 730 is configured to input a text classification model trained by the device shown in fig. 6 after the N target texts are spliced, so as to obtain a text classification result for the target text.

According to an embodiment of a further aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 3 or fig. 4.

According to an embodiment of yet another aspect, there is also provided a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, implements the method described in connection with fig. 3 or 4.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.

Claims

1. A method of training a text classification model, comprising:

acquiring N original texts and N corresponding text class labels, wherein N is a positive integer greater than 1;

splicing the N original texts to obtain spliced texts;

performing single-heat coding on the N text category labels to obtain N category label vectors;

carrying out average processing on the N category label vectors to obtain a comprehensive label vector;

inputting the spliced texts into a text classification model to obtain comprehensive classification results aiming at the N original texts;

training the text classification model based on the comprehensive classification result and the comprehensive label vector.

2. The method of claim 1, wherein training the text classification model based on the comprehensive classification result and the comprehensive label vector comprises:

determining a cross entropy loss based on the comprehensive classification result and the comprehensive label vector;

and adjusting model parameters in the text classification model by utilizing the cross entropy loss.

3. The method of claim 1, wherein the N original texts are historical user session texts collected in a customer service scene, the N text category labels are standard question categories or standard question category identifications, and the text classification model is a question prediction model.

4. The method of claim 1, wherein the text classification model is based on a deep neural network DNN, a recurrent neural network RNN, a long short term memory network LSTM, a transducer model, or a Bert model.

5. A text classification method, comprising:

acquiring a target text to be classified;

copying the target texts to obtain N target texts;

after N target texts are spliced, inputting a text classification model trained by the method of claim 1, and obtaining a text classification result aiming at the target texts.

6. A training device for a text classification model, comprising:

the acquiring unit is configured to acquire N original texts and N corresponding text category labels, wherein N is a positive integer greater than 1;

the splicing unit is configured to splice the N original texts to obtain spliced texts;

the coding unit is configured to perform single-heat coding on the N text category labels respectively to obtain N category label vectors;

the averaging unit is configured to average the N category label vectors to obtain a comprehensive label vector;

the prediction unit is configured to input the spliced text into a text classification model to obtain comprehensive classification results aiming at the N original texts;

and the training unit is configured to train the text classification model based on the comprehensive classification result and the comprehensive label vector.

7. The apparatus of claim 6, wherein the training unit is specifically configured to:

8. The apparatus of claim 6, wherein the N original texts are historical user session texts collected in a customer service scene, the N text category labels are standard question categories or standard question category identifications, and the text classification model is a question prediction model.

9. The apparatus of claim 6, wherein the text classification model is based on a deep neural network DNN, a recurrent neural network RNN, a long short term memory network LSTM, a transducer model, or a Bert model.

10. A text classification device, comprising:

an acquisition unit configured to acquire a target text to be classified;

the copying unit is configured to copy the target texts to obtain N target texts;

and the prediction unit is configured to splice the N target texts, and then input a text classification model obtained through training by the device of claim 6 to obtain a text classification result aiming at the target texts.

11. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-5.

12. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-5.