WO2021143396A1

WO2021143396A1 - Method and apparatus for carrying out classification prediction by using text classification model

Info

Publication number: WO2021143396A1
Application number: PCT/CN2020/134518
Authority: WO
Inventors: 熊涛
Original assignee: 支付宝(杭州)信息技术有限公司
Priority date: 2020-01-16
Filing date: 2020-12-08
Publication date: 2021-07-22
Also published as: CN111291183B; CN111291183A

Abstract

A method and apparatus for carrying out classification prediction by using a text classification model. The text classification model comprises an embedding layer, a convolutional layer, an attention layer and a classifier. The method for carrying out classification prediction comprises: carrying out word embedding on tag description text corresponding to K categories in advance to obtain K tag vectors; during prediction, carrying out word embedding on input text by using an embedding layer so as to obtain a word vector sequence; at a convolutional layer, carrying out convolution processing on the word vector sequence by using convolution windows of different widths so as to obtain a fragment vector sequence; then, at an attention layer, respectively carrying out first attention processing on each vector sequence, wherein the first attention processing comprises determining, according to the similarity between a vector element in the vector sequence and the K tag vectors, a weight factor of the vector element, and then carrying out weighted summation to obtain a first sequence vector; and obtaining a representation vector of the input text on the basis of the first sequence vector of each sequence, and a classifier obtaining a category prediction result of the input text on the basis of the representation vector.

Description

Method and device for classification prediction using text classification model

Technical field

One or more embodiments of the present specification relate to the field of machine learning, and in particular, to a method and device for classification prediction using a text classification model.

Background technique

Text classification is a common and typical type of natural language processing tasks performed by computers, and is widely used in a variety of business implementation scenarios. For example, in an intelligent question answering customer service system, the questions raised by the user need to be classified as input text for user intention recognition, automatic question and answer, or manual customer service dispatch. More specifically, when performing automatic question and answer, the classified categories can correspond to various standard questions organized in advance. Correspondingly, by categorizing user questions, the standard question corresponding to the user's casual and colloquial question description can be determined, and then the answer to the question can be determined and pushed to the user. When dispatching manual customer service orders, the classified categories can correspond to the manual customer service skill sets that are trained for different knowledge fields. Correspondingly, by categorizing user questions, it is possible to identify the knowledge domain to which the user question belongs, and then assign the user question to the manual customer service of the corresponding skill group. Text classification can also be used in various application scenarios, such as document data classification, public opinion analysis, spam identification, and so on.

In various business implementation scenarios, the accuracy of text classification is the core issue of concern. Therefore, it is hoped that there will be an improved solution that can further improve the accuracy of text classification.

Summary of the invention

One or more embodiments of this specification describe a method and device for text classification prediction using a text classification model. The text classification model comprehensively considers the semantic information of text fragments of different lengths and the correlation information with the label description text. Text classification prediction, thereby improving the accuracy and efficiency of classification prediction.

According to the first aspect, a method for classification prediction using a text classification model is provided, which is used to predict the category corresponding to the input text in predetermined K categories; the text classification model includes an embedding layer, a convolutional layer, and attention Layer and classifier, the attention layer includes a first attention module, and the method includes: obtaining K label vectors corresponding to the K categories, wherein each label vector describes the text of the corresponding category by the label vector Obtained by word embedding; using the embedding layer to perform word embedding on the input text to obtain a word vector sequence; inputting the word vector sequence into the convolutional layer, and the convolutional layer uses several text fragments of different lengths Corresponding several convolution windows, perform convolution processing on the word vector sequence to obtain several fragment vector sequences; the word vector sequence and several fragment vector sequences constitute a vector sequence set; each of the vector sequence sets is separately The vector sequence is input to the first attention module to perform first attention processing to obtain each first sequence vector corresponding to each vector sequence; wherein, the first attention processing includes, according to each vector element in the input vector sequence Determine the first weighting factors corresponding to the K label vectors respectively, and use the first weighting factors to weight and sum each vector element; according to the first sequence vectors, Obtain the first attention representation of the input text; determine the characterization vector of the input text according to at least the first attention expression; input the characterization vector into the classifier to obtain the input text in the Category prediction results in K categories.

In one embodiment, the input text is a user question; correspondingly, the label description text corresponding to each of the K categories includes a standard question description text.

In an embodiment, the K label vectors are predetermined in the following manner: for each of the K categories, a label description text corresponding to the category is obtained; word embedding is performed on the label description text, Obtain the word vector of each description word contained in the label description text; synthesize the word vectors of each description word to obtain the label vector corresponding to the category.

According to one embodiment, in the first attention processing, the first weighting factor corresponding to each vector element is specifically determined by the following method: For each vector element in the input vector sequence, calculate the difference between the vector element and the K label vectors K similarities between the two; based on the maximum value of the K similarities, the first weighting factor corresponding to the vector element is determined.

More specifically, in different embodiments, calculating the K similarities between the vector element and the K label vectors may include: calculating the cosine similarity between the vector element and each label vector; or, based on the The Euclidean distance between the vector element and each label vector determines the similarity; or, based on the dot product result of the vector element and each label vector, the similarity is determined.

In addition, in an embodiment, based on the maximum value of the K similarities, determining the first weighting factor corresponding to the vector element specifically includes: determining the value of the vector element based on the maximum value of the K similarities Mutual attention score; according to each mutual attention score corresponding to each vector element, normalize the mutual attention score of the vector element to obtain the first weighting factor corresponding to the vector element.

In an embodiment, obtaining the first attention representation of the input text according to the respective first sequence vectors may specifically include: synthesizing the respective first sequence vectors to obtain the first attention representation, The synthesis includes one of the following: summation, weighted summation, and averaging.

According to an embodiment, the attention layer may further include a second attention module; correspondingly, the method further includes inputting each vector sequence in the vector sequence set into the second attention module to perform The second attention processing obtains each second sequence vector corresponding to each vector sequence; wherein, the second attention processing includes, for each vector element in the input vector sequence, according to the vector element and the input vector sequence Determine the second weighting factor corresponding to the vector element, and use the second weighting factor to weight and sum each vector element in the input sequence; according to each second sequence vector, Obtain the second attention representation of the input text.

In the case where the attention layer includes a first attention module and a second attention module, the characterization vector may be determined according to the first attention representation and the second attention representation.

Further, in the second attention processing, the second weighting factor corresponding to the vector element may be determined in the following manner: calculating each similarity between the vector element and the other vector elements; The average value determines the second weighting factor corresponding to the vector element.

According to another embodiment, the attention layer further includes a third attention module in which attention vectors are maintained; the method further includes forming a total sequence based at least on the splicing of each vector sequence in the vector sequence set; using The third attention module performs third attention processing on the total sequence, and the third attention processing includes, for each vector element in the total sequence, according to the relationship between the vector element and the attention The similarity between the vectors is determined, the third weighting factor corresponding to the vector element is determined, and the third weighting factor is used to weight and sum each vector element in the total sequence to obtain the third attention representation of the input text .

In the case that the attention layer includes a first attention module and a third attention module, the characterization vector may be determined according to the first attention representation and the third attention representation.

In the case that the attention layer includes a first attention module, a second attention module, and a third attention module, it may be based on the first attention representation, the second attention representation and the third attention Indicates that the characterization vector is determined.

Further, in an example, the first attention expression, the second attention expression and the third attention expression may be weighted and summed based on a predetermined weight coefficient to obtain the characterization vector.

In one embodiment, the attention layer further includes a fusion module; before forming the total sequence input to the third attention module, the method further includes: separately inputting each vector sequence in the vector sequence set into the fusion The module performs fusion conversion processing to obtain each fusion sequence corresponding to each vector sequence, where the fusion conversion processing includes, for each vector element in the input vector sequence, according to the vector element and each label vector in the K label vectors The similarity between each tag vector is determined, and the tag weight factor corresponding to each tag vector is determined, and the vector element is converted into the fusion vector of the K tag vector weighted summation based on the tag weight factor, thereby converting the input vector sequence into The corresponding fusion sequence.

Correspondingly, in one embodiment, each vector sequence and each fusion sequence may be spliced to obtain the total sequence, which is input into the third attention module.

According to an embodiment, the input text is training text, and the training text corresponds to a category label indicating its true category; the method further includes: obtaining a text prediction loss according to the category prediction result and the category label ; Determine the total prediction loss at least according to the text prediction loss; update the text classification model in a direction that reduces the total prediction loss, thereby training the text classification model.

Further, in an example of this implementation manner, the method further includes: inputting the K label vectors corresponding to the K categories into the classifier respectively to obtain the corresponding K prediction results; and comparing the K respectively. For each category and its corresponding prediction result, the label prediction loss is obtained based on the comparison result. In this case, the total loss can be determined according to the text prediction loss and the label prediction loss, so as to perform model training.

According to a second aspect, a device for classification prediction using a text classification model is provided, which is used to predict the category corresponding to the input text in predetermined K categories; the text classification model includes an embedding layer, a convolutional layer, and an attention Layer and classifier, the attention layer includes a first attention module, the device includes: a label vector acquisition unit configured to acquire K label vectors corresponding to the K categories, wherein each label vector passes The label description text of the corresponding category is obtained by word embedding; the word sequence obtaining unit is configured to use the embedding layer to perform word embedding on the input text to obtain a word vector sequence; the fragment sequence obtaining unit is configured to obtain the word vector The sequence is input to the convolutional layer, and the convolutional layer uses several convolution windows corresponding to several text fragments of different lengths to perform convolution processing on the word vector sequence to obtain several fragment vector sequences; the word vector A sequence and several fragment vector sequences constitute a vector sequence set; the first attention unit is configured to input each vector sequence in the vector sequence set into the first attention module to perform first attention processing to obtain each Each first sequence vector corresponding to the vector sequence; wherein, the first attention processing includes determining the corresponding first sequence of each vector element according to the similarity between each vector element in the input vector sequence and the K label vectors A weighting factor, and using the first weighting factor to weight and sum each vector element; a first representation obtaining unit configured to obtain a first attention representation of the input text according to each first sequence vector; characterization The vector determining unit is configured to determine the characterization vector of the input text according to at least the first attention expression; the prediction result obtaining unit is configured to input the characterization vector into the classifier to obtain the input text in the The category prediction results in the K categories are described.

According to a third aspect, there is provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect.

According to a fourth aspect, there is provided a computing device, including a memory and a processor, the memory stores executable code, and the processor implements the method of the first aspect when the executable code is executed by the processor.

According to the method and device provided in the embodiments of this specification, the convolutional layer and the attention layer in the text classification model are used to comprehensively consider the text fragments of different lengths and the similarity information with the label vector to obtain the characterization vector, thereby making the characterization-based When the vector is used for text classification, more consideration is given to the contextual semantic information of different lengths and the relevance information of the label description text, so as to obtain more accurate category prediction results.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. A person of ordinary skill in the art can obtain other drawings based on these drawings without creative work.

Fig. 1 is a schematic diagram of a text classification model according to an embodiment disclosed in this specification;

Figure 2 shows a flow chart of a method for text classification using a text classification model according to an embodiment;

FIG. 3 shows a schematic diagram of performing convolution processing on a word vector sequence in an embodiment;

FIG. 4 shows a schematic diagram of performing first attention processing on an input vector sequence in an embodiment;

Figure 5 shows a schematic diagram of performing second attention processing on an input vector sequence in an embodiment;

Fig. 6 shows a schematic diagram of performing fusion conversion processing on an input vector sequence in an embodiment;

FIG. 7 shows a schematic diagram of attention processing of the attention layer in an embodiment;

Figure 8 shows the further method steps involved in the model training stage;

Fig. 9 shows a schematic block diagram of a text classification prediction device according to an embodiment.

Detailed ways

The following describes the solutions provided in this specification with reference to the accompanying drawings.

As mentioned earlier, in a variety of application scenarios such as intelligent customer service robots, it is necessary to accurately classify the input text. A variety of neural network models with structures and algorithms have been proposed for text classification tasks. However, some of the existing models are too complex, some are too generalized and have low accuracy, and still have shortcomings.

Taking into account the characteristics of text classification tasks, in the embodiments of this specification, a new text classification model is proposed, which further improves the classification and prediction effect of text by comprehensively considering the information of text fragments and the information of label description text. .

FIG. 1 is a schematic diagram of a text classification model according to an embodiment disclosed in this specification. As shown in FIG. 1, the text classification model includes an embedding layer 11, a convolutional layer 12, an attention layer 13, and a classifier 14.

The embedding layer 11 uses a specific word embedding algorithm to convert each input word into a word vector. Using the embedding layer 11, the label description texts corresponding to the K categories as classification targets can be converted into K label vectors in advance. When classifying and predicting the input text, the embedding layer 11 embeds the input text and converts it into a sequence of word vectors.

The convolution layer 12 is used to perform convolution processing on the word vector sequence. In the embodiment of this specification, in order to consider the influence of text spans of different lengths on the semantic understanding of the input text, the convolutional layer 12 uses multiple convolution kernels or convolution windows of different widths to perform convolution processing, thereby Obtain multiple fragment vector sequences, which are used to represent the input text at the level of text fragments of different lengths.

The attention layer 13 adopts an attention mechanism and combines label vectors to process the above-mentioned vector sequences. In particular, the attention layer 13 may include a first attention module 131 for performing first attention processing on the input vector sequence. The first attention processing includes synthesizing each vector element according to the similarity between each vector element in the input vector sequence and the aforementioned K label vectors, so as to obtain a sequence vector corresponding to the input vector sequence. Therefore, the first attention processing can also be referred to as tag attention processing, and the first attention module can also be referred to as a co-attention module (with tags).

Optionally, the attention layer 13 may further include a second attention module 132 and/or a third attention module 133. The second attention module 132 may be called an intra-attention module, which is used to synthesize each vector element according to the similarity between each vector element and other vector elements in the input vector sequence. The third attention module 133 may be called a self-attention module, which is used to synthesize each vector element according to the similarity between each vector element in the input vector sequence and the attention vector.

Based on the synthesis of the sequence vectors obtained by each attention module, the characterization vector of the input text can be obtained and input into the classifier 14. The classifier 14 determines the classification corresponding to the input text based on the characterization vector, and realizes the classification prediction of the text.

It can be seen that the text classification model shown in Figure 1 has at least the following characteristics. First, the text classification model characterizes the input text at the level of text fragments of different lengths, and obtains multiple fragment-level vector sequences, so as to better explore the semantic information of contexts of different lengths. In addition, for each category to be classified, unlike conventional techniques that only use meaningless labels (such as numbers) to represent the categories, the text classification model in this embodiment also embeds the label description text of each category to obtain The label vector representation of semantic information. Moreover, through the mutual attention module, the sequence representation of each sequence is synthesized based on the similarity between each element in the word vector sequence and the segment vector sequence and the label vector. Therefore, the final representation vector of the input text contains the similarity information between the vector sequence of different levels (the level of words, the level of text fragments of different lengths) and the label vector, so as to make better use of the context information of the input text and The semantic similarity information of the description text with the label is used to classify the text, thereby improving the classification accuracy.

The following specifically describes the process of text classification using the above text classification model.

Fig. 2 shows a flowchart of a method for text classification using a text classification model according to an embodiment. It can be understood that the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities. As shown in Figure 2, the text classification process includes at least the following steps.

In step 21, K label vectors corresponding to the K categories as classification targets are obtained, wherein each label vector is obtained by word embedding the label description text of the corresponding category.

It can be understood that, for a text classification task, the K categories as classification targets are predetermined. In the conventional technology, tags are generally used to represent the K categories. The tags are, for example, the numbers from 1 to K, the id numbers of the categories, or the one-hot codes of the K categories, and so on. Generally, the tag itself often does not contain semantic information, but is just a code for the category. However, each category often has corresponding description information describing the characteristics of the content of the category, and we can use it as the description information for the label, that is, the label description text. The label description text often contains semantic information related to the corresponding category.

For example, in an automatic question answering scenario of an intelligent customer service robot, K categories as classification targets correspond to predetermined K standard questions. Correspondingly, the label description text of each category is the standard question description text corresponding to the category. For example, the label description text of category 1 is the standard question 1 "How to repay Huabei" under this category, and the label description text of category 2 is the standard question 2 "How much money can I borrow" under the category.

For another example, in a scenario where manual customer service is automatically dispatched, the classification targets are K categories corresponding to the predetermined K manual customer service skill sets. Correspondingly, the label description text of each category may be a description of the corresponding skill group, for example, including the knowledge field of the skill group. In other scenarios, the label description text corresponding to each category can also be correspondingly obtained.

By embedding the label description text, the label vector corresponding to each category can be obtained. The process of converting the label description text of each category into a label vector may include the following steps.

First, for each category Cj in the K categories, obtain the label description text corresponding to the category Cj, for example, "How to repay the money". Then, a specific word embedding algorithm is used to embed each descriptive word contained in the label description text to obtain the word vector of each descriptive word. The aforementioned specific word embedding algorithm may be an algorithm in an existing word embedding tool, such as word2vec, or a pre-trained word embedding algorithm for a specific text scene. Assuming that the specific word embedding algorithm used converts each word into an h-dimensional vector, and the label description text contains m words, in this step, m h-dimensional vectors corresponding to the label description text are obtained.

Next, the word vectors of each descriptor are synthesized to obtain the label vector l _j corresponding to the category Cj. Specifically, the m h-dimensional vectors obtained in the previous step may be synthesized, and the h-dimensional vector obtained after synthesis may be used as the label vector l _j . More specifically, the above-mentioned synthesis can be averaging, summation, weighted summation, and so on. When the number of words contained in each label description text is different, it is preferable to obtain the above-mentioned label vector by averaging.

The above process of word embedding on the label description text can be performed by the embedding layer 11 in FIG. 1. In one embodiment, the embedding layer 11 may convert the label description texts of the K categories into label vectors in advance, and store the obtained K label vectors in the memory for use in classification prediction. Correspondingly, in step 21, K pre-stored tag vectors are read. In another example, when performing classification prediction, the respective label description texts of the K categories may be input to the embedding layer, and word embedding may be performed to obtain the label vector of each category.

Therefore, through the above method, K label vectors corresponding to the K categories are obtained.

In addition, in step 22, using the embedding layer 11, word embedding is performed on the input text to obtain a word vector sequence. As mentioned above, the embedding layer 11 adopts the aforementioned specific word embedding algorithm to perform word embedding on each word in the input text, so as to obtain the word vector sequence corresponding to the input text. Assuming that the input text contains N words {w ₁ ,w ₂ ,...,w _N } arranged in sequence, the word vector sequence X ^W can be obtained:

in,

Represents the word vector corresponding to the i-th word w _i.

It should be understood that

steps

21 and 22 can be executed in parallel or in any order, which is not limited here.

Next, in step 23, the above-mentioned word vector sequence is input to the convolution layer 12, and a number of convolution kernels or convolution windows of different widths are used to perform convolution processing on the word vector sequence. This is because, in text classification, context is critical to text semantic understanding. However, for different words in different texts, helpful contextual semantic information may be hidden in contextual texts of different lengths from the current word. Therefore, the inventor proposes to characterize the input text at the level of text spans of different lengths. Therefore, according to the embodiment of the present specification, in the convolutional layer 12, several convolution windows of different widths corresponding to several text segments of different lengths are used to perform convolution processing on the word vector sequence to obtain several segment vector sequences.

Specifically, the width W of the convolution window can be expressed as W=2r+1, where r is the coverage radius. The process of using a window with a width W=2r+1 to perform convolution processing on the word vector sequence may include:

Word vectors in

The position of is used as the current position, and the convolution operation is performed on multiple word vectors within the range of the current position as the center and r as the radius to obtain the fragment vector of the text fragment corresponding to the current position

The fragment vectors at each position are arranged in sequence to form a sequence of fragment vectors.

Fig. 3 shows a schematic diagram of performing convolution processing on a sequence of word vectors in an embodiment. In the example of FIG. 3, a convolution window with a width of 5 (radius 2) is used for convolution processing. As shown in Figure 3, when the word vector

When used as the current word, the convolution window covers the current word as the center, and the continuous 5 word vectors formed by the two word vectors before and after, namely

Perform convolution operation on these 5 word vectors to get the fragment vector corresponding to the position i

Wherein, the aforementioned convolution operation may be a combination operation of word vectors defined by an activation function. When sliding the convolution window, the word vector

As the current word

Perform convolution operation on the 5 word vectors at the center to obtain the fragment vector corresponding to the position i+1

By sequentially performing convolution processing with each word vector in the N word vectors as the center, the fragment vectors corresponding to the N positions are obtained, and the fragment vector sequence corresponding to the convolution window is formed

The above describes the process of performing convolution processing on the word vector sequence with a convolution window of a specific width. As mentioned above, in step 23, the convolution layer uses several convolution windows with different widths for processing. For example, in a specific example, using four convolution windows with widths of 3, 5, 9, 15 and processing the word vector sequence X ^W separately, four fragment vector sequences X ^S1 , X ^S2 , X ^S3 , X ^S4 , these four fragment vector sequences respectively represent the representation of the input text at the level of text fragments with lengths of 3, 5, 9, 15 words.

In different embodiments, the number of convolution windows used and the width of each convolution window can be determined according to factors such as the length of the input text, the length of the text fragments to be considered, and so on, so that several fragment vector sequences are obtained.

The above word vector sequence X ^W and several fragment vector sequences X ^S can form a vector sequence set. The vector sequences in the set contain N h-dimensional vector elements, which can be simply uniformly denoted as the vector sequence X.

Therefore, in step 24, each vector sequence X in the above vector sequence set is input to the first attention module in the attention layer, and the first attention processing is performed to obtain each first attention corresponding to each vector sequence X. Sequence vector. As mentioned earlier, the first attention module is also called the mutual attention module (with labels). Correspondingly, the first attention processing can also be called label attention processing. Similarity, the corresponding sequence vector is obtained. Specifically, the first attention processing may include, for each vector element x _i in the input vector sequence X, determining the vector element according to the _{similarity between the vector element x i} and the K label vectors obtained in step 21 The first weighting factor corresponding to x _i , and the weighted summation of each vector element in the input vector sequence using the first weighting factor, to obtain the first sequence vector V1(X) corresponding to the input vector sequence X.

In a specific embodiment, determining _{the first weighting factor corresponding to the vector element x i} can be performed in the following manner.

First, calculate the similarity a _ij _{between the vector element x i} and each label vector l _j , where j is from 1 to K, so K similarities are obtained.

In an example, the similarity a _ij _{between the vector element x i} and the label vector l _j can be calculated by the cosine similarity, as shown in the following formula (1):

in,

It denotes the transpose of x _i, ‖x _i ‖ x _i represents the norm, or vector length, ‖l _j l _j || represents the norm.

In another example, the similarity a _ij _{between the vector element x i} and the label vector l _j can also be determined based on the Euclidean distance between the two. The greater the distance, the smaller the similarity. In another example, the similarity a _ij can also be directly determined as the dot product (inner product) result of _{the vector element x i} and the label vector l _j

In more examples, the similarity can also be determined in other ways.

Then, for the determined _{K similarities between the vector elements x i} and the K label vectors, the maximum value can be determined, and the first weighting factor corresponding to _{the vector element x i can be determined based on the maximum value.}

It should be understood here that, as the target of classification, there are large differences between the contents of the K categories, and correspondingly, the corresponding K label vectors are usually far away from each other in the corresponding vector space. As long as the vector element x _i has a high similarity with any label vector l _j , it means that the word or text fragment corresponding to the vector element may have a greater relationship with the corresponding category j. Therefore, the vector element should be given _{Xi is} more concerned or attention (attention), giving it a higher weight. Therefore, in the above steps, the first weighting factor of the vector element is determined according to the maximum value of the similarity.

In an embodiment, the maximum value of the K similarities is directly used as the first weighting factor corresponding to _{the vector element x i}

In another embodiment, the maximum value of the vector elements corresponding to the K x _i of the similarity score is determined for a _i x _i mutual attention vector element, and similarly, the input vector to obtain a sequence of vector elements in each of the respective The corresponding mutual attention scores. Then, according to each mutual attention score corresponding to each vector element, normalize the mutual attention score a _{i of the} _{vector element x i} to obtain the first weighting factor corresponding to the vector element

In a specific example, the above-mentioned normalization processing is realized by the softmax function, as shown in the following formula (2):

On the basis of determining the first weight factor corresponding to each vector element in the input vector sequence X, the first attention module can weight and sum each vector element based on the first weight factor to obtain the first weight factor of the input vector sequence X. Sequence vector V1(X), namely:

Fig. 4 shows a schematic diagram of performing first attention processing on an input vector sequence in an embodiment. As shown in Figure 4, taking N vector elements in the input vector sequence as rows and K label vectors as columns, respectively calculate _{the similarity between each vector element x i} and each label vector l _j , so that a N * The K-dimensional similarity matrix is called the label attention matrix. Perform the maximum pooling operation on the label attention matrix, that is, select the maximum value in a column corresponding to each vector element to obtain the mutual attention score of each vector element, and then obtain its weighting factor based on the mutual attention score, based on the weight The factor weights and sums each vector element to obtain the first sequence vector representation V1 of the input vector sequence.

By performing the above-mentioned first processing power processing on each vector sequence X in the aforementioned vector set, respectively, the corresponding first sequence vectors can be obtained respectively. Specifically, the word vector sequence X ^W obtains the corresponding first sequence vector V1 (X ^W ), and the several fragment vector sequences X ^S obtain the corresponding several first sequence vectors V1 (X ^S ).

_{Therefore, in step 25, the first attention representation S label of the} input text is obtained according to each first sequence vector corresponding to each of the above vector sequences. Specifically, each first sequence vector, including V1(X ^W ) and several V1(X ^S ), can be synthesized. The synthesis method can include summation, weighted summation, averaging, etc., so that the first sequence vector is obtained. Attention means S _label .

Then, in step 26, the characterization vector S of the input text is determined according to at least the above-mentioned first attention representation S _label. In an example, the first attention representation can be used as the characterization vector S.

Next, in step 27, the characterization vector S is input to the classifier 14, and through the operation of the classifier, the category prediction results of the input text in the K categories are obtained.

Through the above process, we can see that using the convolutional layer and the first attention module, the semantic information of text fragments of different lengths and the similarity information with the label vector are integrated in the representation vector, so that the text classification based on the representation vector , More consideration is given to contextual semantic information of different lengths and relevance information to the label description text, so as to obtain more accurate category prediction results.

According to an embodiment, as shown by the dashed box in FIG. 1, the convolutional layer 13 may further include a second attention module 132 and/or a third attention module 133. The processing procedures of the second attention module and the third attention module are described below.

As mentioned above, the second attention module 132 is also known as the intra-attention module, which is used to perform calculations on each vector element according to the similarity between each vector element in the input vector sequence and other vector elements. comprehensive.

Specifically, when the vector sequence X is input to the second attention module 132, the module 132 performs second attention processing on the input vector sequence X, also called internal attention processing. The internal attention processing specifically includes: _{For each vector element x i} in the input vector sequence X, according to the similarity between the vector element and each other vector element x _j in the input vector sequence X, determine the second weighting factor corresponding to the vector element x _{i, and use} The second weighting factor weights and sums each vector element in the input sequence to obtain the second sequence vector V2(X) corresponding to the input vector sequence X.

In a specific embodiment, determining _{the second weighting factor corresponding to the vector element x i} can be performed in the following manner.

First, calculate each similarity a _ij between the vector element x _i and each other vector element x _j . The calculation of the similarity can adopt the cosine similarity, or it can be determined based on other methods such as the vector distance, the vector dot multiplication result, etc., which will not be repeated here.

Then, based on the average of the above similarities, determine the second weighting factor corresponding to _{the vector element x i}

It should be understood here that the second weighting factor aims to measure the relevance of a certain vector element to the overall semantics of the entire vector sequence. If a vector element x _i has a relatively high similarity with other vector elements in the sequence, it means that the word or text fragment corresponding to the vector element has a large correlation with the core semantics of the entire sequence. Therefore, the vector element should be given x _i More attention or attention is given to it with higher weight. Moreover, in the actual calculation, for the convenience of calculation, _{the N similarities between each vector element x i} and the N vector elements in the sequence will be calculated, including the vector element x obtained when j=i _{The similarity between i} and itself, and the self-similarity is a constant corresponding to the maximum value of the similarity. Therefore, when determining the second weighting factor, it is preferable to determine it based on the average value of each similarity, rather than the maximum value.

In an embodiment, the mean value of the aforementioned similarity is directly used as the second weighting factor corresponding to _{the vector element x i}

In another embodiment, the elements of the vector x _i corresponding to the mean similarity score is determined for a _i x _i of vector elements within the attention, and attention based on the scores of each vector element, for example by a softmax normalization function Processing to obtain the second weighting factor corresponding to _{the vector element x i}

After determining the second weighting factor corresponding to each vector element in the input vector sequence X, the second attention module can weight and sum each vector element based on the second weighting factor to obtain the second weighting factor of the input vector sequence X. Sequence vector V2(X), namely:

Fig. 5 shows a schematic diagram of performing second attention processing on an input vector sequence in an embodiment. As shown in Figure 5, the N vector elements in the input vector sequence are arranged into rows and columns, respectively, _{and the similarity between the two vector elements x i} and x _j is calculated, so that an N*N-dimensional The similarity matrix is called the internal attention matrix. Perform an average pooling operation on the internal attention matrix, that is, calculate the average value of a column of similarity values corresponding to each vector element to obtain the internal attention score of each vector element, and then obtain its weighting factor based on the internal attention score, Based on the weighting factor, each vector element is weighted and summed to obtain the second sequence vector representation V2 of the input vector sequence.

Each vector sequence X in the aforementioned vector set can be input to the second attention module 132 to perform the above-mentioned internal processing power processing, so as to obtain respectively corresponding second sequence vectors V2(X), including V2 corresponding to the ^{word vector sequence X W} (X ^W ), a number of second sequence vectors V2(X ^S ) corresponding to a number of segment vector sequences X ^S.

Then, the second sequence vectors V2(X) corresponding to the above vector sequences can be synthesized to obtain the second attention representation S _{intra of the} input text.

In this way, in the case where the attention layer includes the first attention module 131 and the second attention module 132, the step 26 of determining the characterization vector S in FIG. 2 may include, based on the first attention representation S _label and the second attention _Denote S intra, determine the characterization vector S. _{Specifically, the first attention representation S label} and the second attention representation S _intra can be synthesized through a variety of methods, such as summation, weighted summation, averaging, etc., to obtain the characterization vector S.

According to an embodiment, the attention layer 13 may further include a third attention module 133. The third attention module 133 may be referred to as a self-attention module, which is used to perform self-attention processing, that is, to synthesize each vector element according to the similarity between each vector element in the input vector sequence and the attention vector.

Specifically, the self-attention module 133 maintains an attention vector v, which has the same dimension as the vector obtained by word embedding, and both are h-dimensional. The parameters contained in the attention vector v can be determined through training.

In addition, unlike the first/second attention modules, which process each vector sequence in the aforementioned vector sequence set separately, the third attention module 133 is based on a total sequence X'formed based on each vector sequence in the vector sequence set. To process. In an embodiment, the above-mentioned total sequence X′ may be a sequence formed by successively splicing each vector sequence in the aforementioned vector sequence set, that is, X′=X ^W X ^S1 X ^S2 .

Therefore, the third attention module 133 performs third attention processing on the total sequence X', that is, self-attention processing, which specifically includes, for each vector element x _i in the total sequence X', according to the vector element x _i The similarity with the attention vector v, the third weighting factor corresponding to the vector element is determined, and the third weighting factor is used to weight and sum each vector element in the total sequence to obtain the third attention of the input text Force expression.

In a specific embodiment, determining _{the third weighting factor corresponding to the vector element x i} can be performed in the following manner.

First, calculate the _{similarity a i} _{between the vector element x i} and the attention vector v as its self-attention score. Among them, the calculation of similarity can adopt cosine similarity, or it can be determined based on other methods such as vector distance, vector dot multiplication result, etc., which will not be repeated here.

Then, based on the above self-attention score, determine the third weighting factor corresponding to _{the vector element x i}

In one embodiment, the above self-attention score is directly used as the third weighting factor corresponding to _{the vector element x i}

In another embodiment, based on the self-attention score of each vector element, through normalization processing, the third weighting factor corresponding to _{the vector element x i is obtained}

In a specific example, _{the similarity between the vector element x i} and the attention vector v is calculated by the vector dot product, and the normalization is by the softmax function, so that the following third weighting factor can be obtained:

Among them, v ^T is the transposition of the attention vector v, and M is the number of vector elements contained in the total sequence X'.

After determining the third weighting factor corresponding to each vector element in the total sequence X′, the third attention module can weight and sum each vector element based on the third weighting factor. Since the total sequence already contains the information of each vector sequence, the result of processing the total sequence can be directly used as the third attention representation S _{self of the} input text, namely:

The above third attention module 133 performs self-attention processing on the total sequence X'formed by splicing each vector sequence together to obtain a third attention representation.

Further, in one embodiment, each vector sequence can be fused and transformed to obtain a corresponding fusion sequence, and the fusion sequence and each vector sequence can be spliced together to form a more comprehensive total sequence X'.

In this embodiment, the attention layer 13 further includes a fusion module, which is used to perform fusion conversion processing on the input vector sequence X and convert it into a corresponding fusion sequence Q. The fusion conversion processing may specifically include, for each vector element x _i in the input vector sequence X, determining the difference with each label according to the similarity between the vector element x _i _{and each label vector l j} in the aforementioned K label vectors The label weight factor corresponding to the vector l _j , and based on the label weight factor, the vector element x _{i is} converted into the fusion vector q _{i of the} weighted summation of K label vectors, thereby converting the input vector sequence X into the corresponding fusion sequence Q .

In a specific embodiment, _{the process of correspondingly transforming the vector element x i} into the fusion vector q _i can be performed in the following manner.

First, calculate the similarity a _ij _{between the vector element x i} and each label vector l _j , where j is from 1 to K. The similarity calculation method can be realized by, for example, formula (1), or it can be determined based on vector distance, dot multiplication operation, etc., and will not be repeated.

Then, according to the similarity a _ij _{between the vector element x i} and each label vector l _j , the label weight factor β _j corresponding to each label vector l _{j is} determined.

In one example, directly as the label similarity weight a _ij tag vectors l _j corresponding weighting factor β _j. In another embodiment, also according to the respective elements of the vector x _i and the similarity of each tag vector, the similarity of a _ij is normalized, as the tag label weight vectors corresponding to the weight factor l _j β _j. For example, the label weight factor can be determined by the following formula:

After determining the label weight factor β _j of each label vector l _j for the vector element x _i , the weighted sum of each label vector can be based on the label weight factor, thereby converting the vector element x _i into the fusion vector q _i :

Fig. 6 shows a schematic diagram of performing fusion conversion processing on an input vector sequence in an embodiment. As shown in Figure 6, taking N vector elements in the input vector sequence X as columns and K label vectors as rows, respectively calculate _{the similarity between each vector element x i} and each label vector l _j , which can form a Similarity matrix. For each vector element x _i , based on each similarity in the row corresponding to the vector element in the similarity matrix, determine the label weighting factor corresponding to each label vector, and weighted and sum each label vector based on the label weighting factor to obtain the vector element corresponding fusion vector x _i q _i.

It can be understood that by _{converting each vector element x i} in the input vector sequence X into a corresponding fusion vector q _i , the vector sequence X can be converted into a fusion sequence Q. Further, by separately inputting each vector sequence in the aforementioned vector sequence set into the fusion module, each corresponding fusion sequence can be obtained, for example ^{, the fusion sequence Q W} ^{corresponding to the word vector sequence X W and} the fusion sequence corresponding to the fragment vector sequence X ^S Q ^S.

In one embodiment, each original vector sequence (X ^W X ^S1 X ^S2 ...) and each fusion sequence (Q ^W Q ^S1 Q ^S2 ...) obtained as above can be spliced to obtain the total sequence X'. Then the third attention module 133 is used to process the total sequence X′ to obtain the third attention representation S _self .

It can be understood that in the case where the attention layer includes the first attention module 131 and the third attention module 133, the step 26 of determining the characterization vector S in FIG. 2 may include, based on the first attention representation S _label and the third attention The force represents S _self , and the characterization vector S is determined. _{Specifically, the first attention representation S label} and the third attention representation S _self can be synthesized in a variety of ways to obtain the characterization vector S.

In the case where the attention layer includes the first attention module 131, the second attention module 132, and the third attention 133, the step 26 of determining the characterization vector S in FIG. 2 may include, based on the first attention representation S _label , the second attention represents S _intra and the third attention represents S _self , and the characterization vector S is determined. Specifically, the weighted summation of the first attention representation, the second attention representation, and the third attention representation can be based on a predetermined weight coefficient to obtain the characterization vector S, as shown in the following formula:

S=ω ₁ S _label +ω ₂ S _intra +ω ₃ S _self (9)

Among them, ω ₁ , ω ₂ , and ω ₃ are weight coefficients, which may be pre-set hyperparameters.

FIG. 7 shows a schematic diagram of attention processing of the attention layer in an embodiment. The schematic diagram shows the input and output of each attention module when the attention layer contains the first, second and third attention modules.

As shown in the figure, the input of the first attention module includes ^{a vector sequence set consisting of a word vector sequence X W} and a segment vector sequence X ^S , and K label vectors. For each vector sequence X in the vector sequence set, the first attention module obtains the first sequence vector of the vector sequence according to the similarity between the vector elements and the K label vectors. By synthesizing each first sequence vector, the first attention representation S _{label of the} input text can be obtained.

The input of the second attention module includes the aforementioned vector sequence set. For each vector sequence X in the set, the second attention module obtains the second sequence vector of the vector sequence according to the similarity between the various vector elements. By synthesizing each second sequence vector, the second attention representation S _{intra of the} input text can be obtained.

The input of the fusion module includes the aforementioned vector sequence set and K label vectors. The fusion module converts each vector sequence X in the vector sequence set into a fusion vector Q through fusion conversion processing. Then, each fusion sequence corresponding to each vector sequence in the vector sequence set is output.

The input of the third attention module is each vector sequence in the aforementioned vector sequence set, and the total sequence formed by the synthesis of each fusion sequence. The third attention module performs self-attention processing on the total sequence, and obtains the third attention representation S _{self of the} input text.

The final characterization vector of the input text can be synthesized based on the output of the first, second and third attention modules.

On the basis of Fig. 1 and Fig. 2, it is described separately that when the attention layer includes the first attention module, and the attention layer also includes the second attention module and/or the third attention module. In this case, the process of classifying and predicting the input text. It needs to be understood that the classification prediction process is not only applicable to the training phase of the text classification model, but also applicable to the use phase after the model training is completed.

In the training phase of the text classification model, the input text input to the model is training text, and the training text corresponds to a category label y indicating its true category. For the training stage, after the category prediction result y'of the training text is obtained based on the method steps of FIG. 2, the model needs to be trained based on the foregoing category prediction result. The training process is shown in FIG. 8.

Specifically, FIG. 8 shows the method steps further included in the model training stage. _{As shown in FIG. 8, in step 81, the text prediction loss L text is} obtained according to the category prediction result y′ for the training text and the category label y of the training text.

It can be understood that the category prediction result y'is obtained by the classifier 14 using a predetermined classification function to perform operations on the characterization vector S of the input text. Therefore, the category prediction result can be expressed as:

y′=f _c (S) (10)

Among them, f _c is the classification function. Generally, the category prediction result y′ includes the probability that the predicted current training text belongs to the predetermined K categories. _{Therefore, the text prediction loss L text} can be obtained based on the probability distribution indicated by the category prediction result y′ and the actual classification indicated by the category label y through a loss function in the form of cross entropy. In other embodiments, other known loss function forms can also be used to obtain the text prediction loss L _text .

In step 82, the total prediction loss L is determined based on at least the aforementioned text prediction loss L _text. In one example, the text prediction loss is determined as the total prediction loss L.

Next, in step 83, the text classification model is updated in the direction that reduces the total prediction loss L. Specifically, gradient descent, back propagation and other methods can be used to adjust the model parameters in the text classification model, so that the total prediction loss L is reduced until a predetermined convergence condition is reached, thereby realizing the training of the model.

Further, in one embodiment, when calculating the total prediction loss, the aforementioned K label vectors are used again. _{Specifically, K label vectors l j} (j from 1 to K) corresponding to the K categories can be input to the classifier 14 respectively, so that the classifier 14 performs classification prediction based on the input label vector to obtain the corresponding K label predictions as a result, vectors l _j where tag labels corresponding prediction result y _"j can be expressed as:

y″ _j = f _c (l _j ) (11)

Then, the K categories and their corresponding label prediction results are respectively compared, and the label prediction loss L _{label is} obtained based on the comparison results. Specifically, for each category, a cross-entropy loss function can be used to obtain the label prediction loss under the category, and then the label prediction loss of each category is summed to obtain the total label prediction loss L _label .

In the case of using the label vector to obtain the label prediction loss, the step 82 of determining the total loss in FIG. 8 may include determining the total loss L according to the text prediction loss L _text and the label prediction loss L _label . Specifically, in an embodiment, the total loss L may be determined as:

L=L _text +γL _label (12)

Among them, γ is a hyperparameter.

By introducing the label prediction loss based on the label vector into the total loss, the classifier can be more targeted for better training.

After a large amount of training text is used to train the text classification model, the text classification model can be used to classify and predict the input text of unknown category. As mentioned above, because the classification prediction model combines the semantic information of different length text segment levels and the semantic information of the label description text, the classification prediction of the text can be realized with higher accuracy.

According to another embodiment, a device for classification prediction using a text classification model is provided. The device is used to predict the category corresponding to the input text in the predetermined K categories. The text classification model used includes an embedding layer, Convolutional layer, attention layer and classifier, attention layer further includes the first attention module, as shown in Figure 1. The above classification prediction device can be deployed in any device, platform or device cluster with computing and processing capabilities. Fig. 9 shows a schematic block diagram of a text classification prediction device according to an embodiment. As shown in FIG. 9, the prediction device 900 includes the following units.

The label vector obtaining unit 901 is configured to obtain K label vectors respectively corresponding to the K categories, where each label vector is obtained by word embedding the label description text of the corresponding category;

The word sequence obtaining unit 902 is configured to use the embedding layer to perform word embedding on the input text to obtain a word vector sequence;

The segment sequence acquiring unit 903 is configured to input the word vector sequence into the convolutional layer, and the convolution layer uses a number of convolution windows corresponding to a number of text segments of different lengths to convolve the word vector sequence Product processing to obtain several fragment vector sequences; the word vector sequence and several fragment vector sequences constitute a vector sequence set;

The first attention unit 904 is configured to input each vector sequence in the vector sequence set into the first attention module to perform first attention processing to obtain each first sequence vector corresponding to each vector sequence; Wherein, the first attention processing includes determining the first weighting factor corresponding to each vector element according to the similarity between each vector element in the input vector sequence and the K label vectors, and using the first weighting factor. The weighting factor is a weighted summation of each vector element;

The first representation obtaining unit 905 is configured to obtain the first attention representation of the input text according to the respective first sequence vectors;

A characterization vector determining unit 906, configured to determine a characterization vector of the input text at least according to the first attention expression;

The prediction result obtaining unit 907 is configured to input the characterization vector into the classifier to obtain category prediction results of the input text in the K categories.

In an example, the label vector obtaining unit 901 is configured to predetermine the K label vectors in the following manner: for each of the K categories, obtain the label description text corresponding to the category; The description text is word-embedded to obtain the word vector of each description word contained in the label description text; the word vectors of each description word are synthesized to obtain the label vector corresponding to the category.

According to an embodiment, in the first attention processing involved in the first attention unit 904, the first weighting factor corresponding to each vector element is determined by the following method: For each vector element in the input vector sequence, calculate the vector element K similarities with the K label vectors; based on the maximum value of the K similarities, the first weighting factor corresponding to the vector element is determined.

Further, the K similarities between the vector element and the K label vectors can be calculated by: calculating the cosine similarity between the vector element and each label vector; or, based on the vector element and each label The Euclidean distance between the vectors determines the similarity; or, based on the dot product result of the vector element and each label vector, the similarity is determined.

In an example, determining the first weighting factor corresponding to the vector element based on the maximum value of the K similarities may include: determining the mutual attention of the vector element based on the maximum value of the K similarities Force score; according to each mutual attention score corresponding to each vector element, normalize the mutual attention score of the vector element to obtain the first weighting factor corresponding to the vector element.

In an example, obtaining the first attention representation of the input text according to the respective first sequence vectors includes: according to an embodiment, by synthesizing the respective first sequence vectors to obtain the first Attention means that the synthesis includes one of the following: summation, weighted summation, and averaging.

According to an embodiment, the attention layer of the text classification model further includes a second attention module. Correspondingly, the device 900 further includes (not shown in the figure) a second attention unit and a second representation acquisition unit, wherein: the second attention unit is configured to separately set each vector sequence in the vector sequence set Input the second attention module to perform second attention processing to obtain each second sequence vector corresponding to each vector sequence; wherein, the second attention processing includes, for each vector element in the input vector sequence , According to the similarity between the vector element and each other vector element in the input vector sequence, determine the second weighting factor corresponding to the vector element, and use the second weighting factor to weight each vector element in the input sequence And; the second representation obtaining unit is configured to obtain a second attention representation of the input text according to the respective second sequence vectors.

In this case, the characterization vector determining unit 906 in FIG. 9 is configured to determine the characterization vector according to the first attention expression and the second attention expression.

More specifically, in the second attention processing involved in the second attention unit, the second weighting factor corresponding to the vector element can be determined in the following manner: calculating the respective similarities between the vector element and the other vector elements; Based on the average of the respective similarities, the second weighting factor corresponding to the vector element is determined.

According to another embodiment, the attention layer further includes a third attention module in which attention vectors are maintained. Correspondingly, the device 900 further includes (not shown in the figure) a total sequence forming unit and a third attention unit, wherein,

The total sequence forming unit is configured to form a total sequence based at least on the splicing of each vector sequence in the vector sequence set; the third attention unit is configured to use the third attention module to perform a third operation on the total sequence Attention processing, the third attention processing includes, for each vector element in the total sequence, determining a third weight corresponding to the vector element according to the similarity between the vector element and the attention vector Factor, and use the third weighting factor to weight and sum each vector element in the total sequence to obtain the third attention representation of the input text.

In the case where the attention layer includes the first attention module and the third attention module, the aforementioned characterization vector determining unit 906 is configured to determine the characterization according to the first attention expression and the third attention expression vector.

In the case that the attention layer includes the first attention module, the second attention module, and the third attention module, the aforementioned characterization vector determining unit 906 is configured to, according to the first attention expression, the second attention expression and The third attention representation determines the characterization vector.

Specifically, in an example, the characterization vector determining unit 906 may perform a weighted summation of the first attention expression, the second attention expression and the third attention expression based on a predetermined weight coefficient to obtain the Representation vector.

In one embodiment, the attention layer further includes a fusion module. Correspondingly, the device 900 further includes a fusion unit (not shown) configured to input each vector sequence in the vector sequence set into the fusion module to perform fusion conversion processing to obtain each fusion sequence corresponding to each vector sequence, The fusion conversion processing includes, for each vector element in the input vector sequence, determining the label weighting factor corresponding to each label vector according to the similarity between the vector element and each of the K label vectors , And based on the tag weight factor, convert the vector element into a fusion vector of the K tag vectors weighted and sum, thereby converting the input vector sequence into a corresponding fusion sequence.

In this case, the total sequence forming unit may be configured to splice the respective vector sequences and the respective fusion sequences to obtain the total sequence.

In one embodiment, the input text is training text, and the training text correspondingly has a category label indicating its true category; the device 900 further includes a training unit (not shown) configured to predict according to the category As a result and the category label, the text prediction loss is obtained; at least the total prediction loss is determined according to the text prediction loss; and the text classification model is updated in the direction that reduces the total prediction loss.

In yet another embodiment, the training unit is further configured to: input the K label vectors corresponding to the K categories into the classifier to obtain the corresponding K prediction results; respectively compare the K categories with their corresponding According to the prediction result, the label prediction loss is obtained based on the comparison result; the total loss is determined according to the text prediction loss and the label prediction loss.

In this way, through the above device, the text classification model is used to achieve accurate classification of the input text.

According to another embodiment, there is also provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.

According to an embodiment of still another aspect, there is also provided a computing device, including a memory and a processor, the memory is stored with executable code, and when the processor executes the executable code, it implements the method described in conjunction with FIG. method.

Those skilled in the art should be aware that, in one or more of the foregoing examples, the functions described in this application can be implemented by hardware, software, firmware, or any combination thereof. When implemented by software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium.

The specific implementations described above further describe the purpose, technical solutions and beneficial effects of this application in detail. It should be understood that the above are only specific implementations of this application and are not intended to limit the scope of this application. The scope of protection, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solution of this application shall be included in the scope of protection of this application.

Claims

A method for classification prediction using a text classification model, which is used to predict the category corresponding to the input text in predetermined K categories; the text classification model includes an embedding layer, a convolutional layer, an attention layer, and a classifier. The attention layer includes a first attention module, and the method includes:

Obtaining K label vectors respectively corresponding to the K categories, where each label vector is obtained by word embedding the label description text of the corresponding category;

Using the embedding layer, perform word embedding on the input text to obtain a word vector sequence;

The word vector sequence is input to the convolutional layer, and the convolutional layer uses a number of convolution windows corresponding to a number of text fragments of different lengths to perform convolution processing on the word vector sequence to obtain a number of fragment vector sequences ; The word vector sequence and several fragment vector sequences constitute a vector sequence set;

Each vector sequence in the vector sequence set is input into the first attention module to perform first attention processing to obtain each first sequence vector corresponding to each vector sequence; wherein, the first attention processing Including, according to the similarity between each vector element in the input vector sequence and the K label vectors, determine the first weight factor corresponding to each vector element, and use the first weight factor to weight and sum each vector element ；

Obtaining the first attention representation of the input text according to the respective first sequence vectors;

Determine a characterization vector of the input text at least according to the first attention representation;

The characterization vector is input to the classifier to obtain category prediction results of the input text in the K categories.
The method according to claim 1, wherein the input text is a user question; the label description text corresponding to each of the K categories includes a standard question description text.
The method according to claim 1 or 2, wherein the K label vectors are predetermined in the following manner:

For each of the K categories, obtain the label description text corresponding to the category;

Performing word embedding on the label description text to obtain the word vector of each description word contained in the label description text;

The word vectors of the various descriptors are synthesized to obtain the label vector corresponding to the category.
The method according to claim 1, wherein determining the first weighting factor corresponding to each vector element according to the similarity between each vector element in the input vector sequence and the K label vectors includes:

For each vector element in the input vector sequence, calculate K similarities between the vector element and the K label vectors;

Based on the maximum value of the K similarities, the first weighting factor corresponding to the vector element is determined.
The method according to claim 4, wherein calculating the K similarities between the vector element and the K label vectors comprises:

Calculate the cosine similarity between the vector element and each label vector; or,

Determine the similarity based on the Euclidean distance between the vector element and each label vector; or,

Based on the dot product result of the vector element and each label vector, the similarity is determined.
The method according to claim 4, wherein, based on the maximum value of the K similarities, determining the first weighting factor corresponding to the vector element comprises:

Determine the mutual attention score of the vector element based on the maximum value of the K similarities;

According to each mutual attention score corresponding to each vector element, normalizing the mutual attention score of the vector element is performed to obtain the first weighting factor corresponding to the vector element.
The method according to claim 1, wherein obtaining the first attention representation of the input text according to the respective first sequence vectors comprises:

The first sequence vectors are synthesized to obtain the first attention expression, and the synthesis includes one of the following: summation, weighted summation, and averaging.
The method according to claim 1, wherein the attention layer further comprises a second attention module; the method further comprises,

Each vector sequence in the vector sequence set is input into the second attention module to perform second attention processing to obtain each second sequence vector corresponding to each vector sequence; wherein, the second attention processing Including, for each vector element in the input vector sequence, according to the similarity between the vector element and each other vector element in the input vector sequence, determine the second weighting factor corresponding to the vector element, and use the first Two weighting factors weighted and sum each vector element in the input sequence;

Obtaining a second attention representation of the input text according to each of the second sequence vectors;

The determining a characterization vector of the input text at least according to the first attention expression includes determining the characterization vector according to the first attention expression and the second attention expression.
8. The method according to claim 8, wherein determining the second weighting factor corresponding to the vector element according to the similarity between the vector element and each other vector element in the input vector sequence comprises:

Calculating each similarity between the vector element and each of the other vector elements;

Based on the average of the respective similarities, the second weighting factor corresponding to the vector element is determined.
The method according to claim 1, wherein the attention layer further comprises a third attention module, wherein an attention vector is maintained; the method further comprises,

At least based on the splicing of each vector sequence in the vector sequence set to form a total sequence;

Use the third attention module to perform third attention processing on the total sequence, and the third attention processing includes, for each vector element in the total sequence, according to the vector element and the attention The similarity between force vectors, determine the third weighting factor corresponding to the vector element, and use the third weighting factor to weight and sum each vector element in the total sequence to obtain the third attention of the input text Express;

The determining a characterization vector of the input text at least according to the first attention expression includes determining the characterization vector according to the first attention expression and the third attention expression.
The method according to claim 8, wherein the attention layer further comprises a third attention module, wherein an attention vector is maintained; the method further comprises,

At least based on the splicing of each vector sequence in the vector sequence set to form a total sequence;

Use the third attention module to perform third attention processing on the total sequence, and the third attention processing includes, for each vector element in the total sequence, according to the vector element and the attention The similarity between force vectors, determine the third weighting factor corresponding to the vector element, and use the third weighting factor to weight and sum each vector element in the total sequence to obtain the third attention of the input text Express;

The determining a characterization vector of the input text at least according to the first attention expression includes, according to the first attention expression, the second attention expression, and the third attention expression, determining the The representation vector.
The method according to claim 10 or 11, wherein the attention layer further comprises a fusion module; before the forming the total sequence, the method further comprises:

Each vector sequence in the vector sequence set is input into the fusion module to perform fusion conversion processing to obtain each fusion sequence corresponding to each vector sequence, wherein the fusion conversion processing includes, for each vector element in the input vector sequence , According to the similarity between the vector element and each of the K label vectors, determine the label weight factor corresponding to each label vector, and convert the vector element into the K label weight factors based on the label weight factor. The fusion vector of the weighted sum of the label vector, thereby converting the input vector sequence into the corresponding fusion sequence;

The forming the total sequence includes splicing the respective vector sequences and the respective fusion sequences to obtain the total sequence.
The method of claim 11, wherein determining the characterization vector comprises:

Based on a predetermined weight coefficient, the first attention representation, the second attention representation, and the third attention representation are weighted and summed to obtain the characterization vector.
The method according to claim 1, wherein the input text is training text, and the training text correspondingly has a category label indicating its true category; the method further comprises:

Obtain the text prediction loss according to the category prediction result and the category label;

Determine the total prediction loss at least according to the text prediction loss;

Update the text classification model in a direction that reduces the total prediction loss.
The method according to claim 14, further comprising:

Input the K label vectors corresponding to the K categories into the classifier respectively to obtain the corresponding K prediction results;

Respectively comparing the K categories with their corresponding prediction results, and obtaining label prediction losses based on the comparison results;

The determining the total loss includes determining the total loss according to the text prediction loss and the label prediction loss.
A device for classification prediction using a text classification model, used for predicting the category corresponding to the input text in predetermined K categories; the text classification model includes an embedding layer, a convolutional layer, an attention layer and a classifier, the The attention layer includes a first attention module, and the device includes:

The label vector obtaining unit is configured to obtain K label vectors corresponding to the K categories, wherein each label vector is obtained by word embedding the label description text of the corresponding category;

The word sequence obtaining unit is configured to use the embedding layer to perform word embedding on the input text to obtain a word vector sequence;

A segment sequence acquisition unit configured to input the word vector sequence into the convolution layer, and the convolution layer uses a number of convolution windows corresponding to a number of text segments of different lengths to convolve the word vector sequence Processing to obtain several fragment vector sequences; the word vector sequence and several fragment vector sequences constitute a vector sequence set;

The first attention unit is configured to input each vector sequence in the vector sequence set into the first attention module to perform first attention processing to obtain each first sequence vector corresponding to each vector sequence; wherein , The first attention processing includes, according to the similarity between each vector element in the input vector sequence and the K label vectors, determining a first weighting factor corresponding to each vector element, and using the first weight The factor is a weighted summation of each vector element;

A first representation obtaining unit, configured to obtain a first attention representation of the input text according to each of the first sequence vectors;

A characterization vector determining unit, configured to determine a characterization vector of the input text at least according to the first attention expression;

The prediction result obtaining unit is configured to input the characterization vector into the classifier to obtain category prediction results of the input text in the K categories.
A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method according to any one of claims 1-15.
A computing device, comprising a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method described in any one of claims 1-15 is method.