WO2021143396A1 - Method and apparatus for carrying out classification prediction by using text classification model - Google Patents

Method and apparatus for carrying out classification prediction by using text classification model Download PDF

Info

Publication number
WO2021143396A1
WO2021143396A1 PCT/CN2020/134518 CN2020134518W WO2021143396A1 WO 2021143396 A1 WO2021143396 A1 WO 2021143396A1 CN 2020134518 W CN2020134518 W CN 2020134518W WO 2021143396 A1 WO2021143396 A1 WO 2021143396A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
attention
sequence
label
text
Prior art date
Application number
PCT/CN2020/134518
Other languages
French (fr)
Chinese (zh)
Inventor
熊涛
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2021143396A1 publication Critical patent/WO2021143396A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • One or more embodiments of the present specification relate to the field of machine learning, and in particular, to a method and device for classification prediction using a text classification model.
  • Text classification is a common and typical type of natural language processing tasks performed by computers, and is widely used in a variety of business implementation scenarios.
  • the questions raised by the user need to be classified as input text for user intention recognition, automatic question and answer, or manual customer service dispatch.
  • the classified categories can correspond to various standard questions organized in advance.
  • the standard question corresponding to the user's casual and colloquial question description can be determined, and then the answer to the question can be determined and pushed to the user.
  • the classified categories can correspond to the manual customer service skill sets that are trained for different knowledge fields.
  • Text classification can also be used in various application scenarios, such as document data classification, public opinion analysis, spam identification, and so on.
  • the accuracy of text classification is the core issue of concern. Therefore, it is hoped that there will be an improved solution that can further improve the accuracy of text classification.
  • One or more embodiments of this specification describe a method and device for text classification prediction using a text classification model.
  • the text classification model comprehensively considers the semantic information of text fragments of different lengths and the correlation information with the label description text. Text classification prediction, thereby improving the accuracy and efficiency of classification prediction.
  • a method for classification prediction using a text classification model which is used to predict the category corresponding to the input text in predetermined K categories;
  • the text classification model includes an embedding layer, a convolutional layer, and attention Layer and classifier,
  • the attention layer includes a first attention module, and the method includes: obtaining K label vectors corresponding to the K categories, wherein each label vector describes the text of the corresponding category by the label vector Obtained by word embedding; using the embedding layer to perform word embedding on the input text to obtain a word vector sequence; inputting the word vector sequence into the convolutional layer, and the convolutional layer uses several text fragments of different lengths Corresponding several convolution windows, perform convolution processing on the word vector sequence to obtain several fragment vector sequences; the word vector sequence and several fragment vector sequences constitute a vector sequence set; each of the vector sequence sets is separately The vector sequence is input to the first attention module to perform first attention processing to obtain each first sequence vector corresponding to each vector sequence; wherein, the first attention processing
  • the input text is a user question; correspondingly, the label description text corresponding to each of the K categories includes a standard question description text.
  • the K label vectors are predetermined in the following manner: for each of the K categories, a label description text corresponding to the category is obtained; word embedding is performed on the label description text, Obtain the word vector of each description word contained in the label description text; synthesize the word vectors of each description word to obtain the label vector corresponding to the category.
  • the first weighting factor corresponding to each vector element is specifically determined by the following method: For each vector element in the input vector sequence, calculate the difference between the vector element and the K label vectors K similarities between the two; based on the maximum value of the K similarities, the first weighting factor corresponding to the vector element is determined.
  • calculating the K similarities between the vector element and the K label vectors may include: calculating the cosine similarity between the vector element and each label vector; or, based on the The Euclidean distance between the vector element and each label vector determines the similarity; or, based on the dot product result of the vector element and each label vector, the similarity is determined.
  • determining the first weighting factor corresponding to the vector element specifically includes: determining the value of the vector element based on the maximum value of the K similarities Mutual attention score; according to each mutual attention score corresponding to each vector element, normalize the mutual attention score of the vector element to obtain the first weighting factor corresponding to the vector element.
  • obtaining the first attention representation of the input text according to the respective first sequence vectors may specifically include: synthesizing the respective first sequence vectors to obtain the first attention representation, The synthesis includes one of the following: summation, weighted summation, and averaging.
  • the attention layer may further include a second attention module; correspondingly, the method further includes inputting each vector sequence in the vector sequence set into the second attention module to perform The second attention processing obtains each second sequence vector corresponding to each vector sequence; wherein, the second attention processing includes, for each vector element in the input vector sequence, according to the vector element and the input vector sequence Determine the second weighting factor corresponding to the vector element, and use the second weighting factor to weight and sum each vector element in the input sequence; according to each second sequence vector, Obtain the second attention representation of the input text.
  • the characterization vector may be determined according to the first attention representation and the second attention representation.
  • the second weighting factor corresponding to the vector element may be determined in the following manner: calculating each similarity between the vector element and the other vector elements; The average value determines the second weighting factor corresponding to the vector element.
  • the attention layer further includes a third attention module in which attention vectors are maintained; the method further includes forming a total sequence based at least on the splicing of each vector sequence in the vector sequence set; using The third attention module performs third attention processing on the total sequence, and the third attention processing includes, for each vector element in the total sequence, according to the relationship between the vector element and the attention The similarity between the vectors is determined, the third weighting factor corresponding to the vector element is determined, and the third weighting factor is used to weight and sum each vector element in the total sequence to obtain the third attention representation of the input text .
  • the characterization vector may be determined according to the first attention representation and the third attention representation.
  • the attention layer includes a first attention module, a second attention module, and a third attention module
  • it may be based on the first attention representation, the second attention representation and the third attention Indicates that the characterization vector is determined.
  • the first attention expression, the second attention expression and the third attention expression may be weighted and summed based on a predetermined weight coefficient to obtain the characterization vector.
  • the attention layer further includes a fusion module; before forming the total sequence input to the third attention module, the method further includes: separately inputting each vector sequence in the vector sequence set into the fusion The module performs fusion conversion processing to obtain each fusion sequence corresponding to each vector sequence, where the fusion conversion processing includes, for each vector element in the input vector sequence, according to the vector element and each label vector in the K label vectors The similarity between each tag vector is determined, and the tag weight factor corresponding to each tag vector is determined, and the vector element is converted into the fusion vector of the K tag vector weighted summation based on the tag weight factor, thereby converting the input vector sequence into The corresponding fusion sequence.
  • each vector sequence and each fusion sequence may be spliced to obtain the total sequence, which is input into the third attention module.
  • the input text is training text
  • the training text corresponds to a category label indicating its true category
  • the method further includes: obtaining a text prediction loss according to the category prediction result and the category label ; Determine the total prediction loss at least according to the text prediction loss; update the text classification model in a direction that reduces the total prediction loss, thereby training the text classification model.
  • the method further includes: inputting the K label vectors corresponding to the K categories into the classifier respectively to obtain the corresponding K prediction results; and comparing the K respectively. For each category and its corresponding prediction result, the label prediction loss is obtained based on the comparison result. In this case, the total loss can be determined according to the text prediction loss and the label prediction loss, so as to perform model training.
  • a device for classification prediction using a text classification model which is used to predict the category corresponding to the input text in predetermined K categories;
  • the text classification model includes an embedding layer, a convolutional layer, and an attention Layer and classifier
  • the attention layer includes a first attention module
  • the device includes: a label vector acquisition unit configured to acquire K label vectors corresponding to the K categories, wherein each label vector passes The label description text of the corresponding category is obtained by word embedding;
  • the word sequence obtaining unit is configured to use the embedding layer to perform word embedding on the input text to obtain a word vector sequence;
  • the fragment sequence obtaining unit is configured to obtain the word vector
  • the sequence is input to the convolutional layer, and the convolutional layer uses several convolution windows corresponding to several text fragments of different lengths to perform convolution processing on the word vector sequence to obtain several fragment vector sequences;
  • the word vector A sequence and several fragment vector sequences constitute a vector sequence set;
  • the first attention unit is configured to input each vector sequence
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect.
  • a computing device including a memory and a processor, the memory stores executable code, and the processor implements the method of the first aspect when the executable code is executed by the processor.
  • the convolutional layer and the attention layer in the text classification model are used to comprehensively consider the text fragments of different lengths and the similarity information with the label vector to obtain the characterization vector, thereby making the characterization-based
  • the vector is used for text classification, more consideration is given to the contextual semantic information of different lengths and the relevance information of the label description text, so as to obtain more accurate category prediction results.
  • Fig. 1 is a schematic diagram of a text classification model according to an embodiment disclosed in this specification
  • Figure 2 shows a flow chart of a method for text classification using a text classification model according to an embodiment
  • FIG. 3 shows a schematic diagram of performing convolution processing on a word vector sequence in an embodiment
  • FIG. 4 shows a schematic diagram of performing first attention processing on an input vector sequence in an embodiment
  • Figure 5 shows a schematic diagram of performing second attention processing on an input vector sequence in an embodiment
  • Fig. 6 shows a schematic diagram of performing fusion conversion processing on an input vector sequence in an embodiment
  • FIG. 7 shows a schematic diagram of attention processing of the attention layer in an embodiment
  • Figure 8 shows the further method steps involved in the model training stage
  • Fig. 9 shows a schematic block diagram of a text classification prediction device according to an embodiment.
  • a new text classification model is proposed, which further improves the classification and prediction effect of text by comprehensively considering the information of text fragments and the information of label description text. .
  • FIG. 1 is a schematic diagram of a text classification model according to an embodiment disclosed in this specification.
  • the text classification model includes an embedding layer 11, a convolutional layer 12, an attention layer 13, and a classifier 14.
  • the embedding layer 11 uses a specific word embedding algorithm to convert each input word into a word vector. Using the embedding layer 11, the label description texts corresponding to the K categories as classification targets can be converted into K label vectors in advance. When classifying and predicting the input text, the embedding layer 11 embeds the input text and converts it into a sequence of word vectors.
  • the convolution layer 12 is used to perform convolution processing on the word vector sequence.
  • the convolutional layer 12 uses multiple convolution kernels or convolution windows of different widths to perform convolution processing, thereby Obtain multiple fragment vector sequences, which are used to represent the input text at the level of text fragments of different lengths.
  • the attention layer 13 adopts an attention mechanism and combines label vectors to process the above-mentioned vector sequences.
  • the attention layer 13 may include a first attention module 131 for performing first attention processing on the input vector sequence.
  • the first attention processing includes synthesizing each vector element according to the similarity between each vector element in the input vector sequence and the aforementioned K label vectors, so as to obtain a sequence vector corresponding to the input vector sequence. Therefore, the first attention processing can also be referred to as tag attention processing, and the first attention module can also be referred to as a co-attention module (with tags).
  • the attention layer 13 may further include a second attention module 132 and/or a third attention module 133.
  • the second attention module 132 may be called an intra-attention module, which is used to synthesize each vector element according to the similarity between each vector element and other vector elements in the input vector sequence.
  • the third attention module 133 may be called a self-attention module, which is used to synthesize each vector element according to the similarity between each vector element in the input vector sequence and the attention vector.
  • the characterization vector of the input text can be obtained and input into the classifier 14.
  • the classifier 14 determines the classification corresponding to the input text based on the characterization vector, and realizes the classification prediction of the text.
  • the text classification model shown in Figure 1 has at least the following characteristics.
  • the text classification model characterizes the input text at the level of text fragments of different lengths, and obtains multiple fragment-level vector sequences, so as to better explore the semantic information of contexts of different lengths.
  • the text classification model in this embodiment also embeds the label description text of each category to obtain The label vector representation of semantic information.
  • the sequence representation of each sequence is synthesized based on the similarity between each element in the word vector sequence and the segment vector sequence and the label vector.
  • the final representation vector of the input text contains the similarity information between the vector sequence of different levels (the level of words, the level of text fragments of different lengths) and the label vector, so as to make better use of the context information of the input text and
  • the semantic similarity information of the description text with the label is used to classify the text, thereby improving the classification accuracy.
  • Fig. 2 shows a flowchart of a method for text classification using a text classification model according to an embodiment. It can be understood that the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities. As shown in Figure 2, the text classification process includes at least the following steps.
  • step 21 K label vectors corresponding to the K categories as classification targets are obtained, wherein each label vector is obtained by word embedding the label description text of the corresponding category.
  • tags are generally used to represent the K categories.
  • the tags are, for example, the numbers from 1 to K, the id numbers of the categories, or the one-hot codes of the K categories, and so on.
  • the tag itself often does not contain semantic information, but is just a code for the category.
  • each category often has corresponding description information describing the characteristics of the content of the category, and we can use it as the description information for the label, that is, the label description text.
  • the label description text often contains semantic information related to the corresponding category.
  • K categories as classification targets correspond to predetermined K standard questions.
  • the label description text of each category is the standard question description text corresponding to the category.
  • the label description text of category 1 is the standard question 1 "How to repay Huabei" under this category
  • the label description text of category 2 is the standard question 2 "How much money can I borrow" under the category.
  • the classification targets are K categories corresponding to the predetermined K manual customer service skill sets.
  • the label description text of each category may be a description of the corresponding skill group, for example, including the knowledge field of the skill group.
  • the label description text corresponding to each category can also be correspondingly obtained.
  • the label vector corresponding to each category can be obtained.
  • the process of converting the label description text of each category into a label vector may include the following steps.
  • a specific word embedding algorithm is used to embed each descriptive word contained in the label description text to obtain the word vector of each descriptive word.
  • the aforementioned specific word embedding algorithm may be an algorithm in an existing word embedding tool, such as word2vec, or a pre-trained word embedding algorithm for a specific text scene. Assuming that the specific word embedding algorithm used converts each word into an h-dimensional vector, and the label description text contains m words, in this step, m h-dimensional vectors corresponding to the label description text are obtained.
  • the word vectors of each descriptor are synthesized to obtain the label vector l j corresponding to the category Cj.
  • the m h-dimensional vectors obtained in the previous step may be synthesized, and the h-dimensional vector obtained after synthesis may be used as the label vector l j .
  • the above-mentioned synthesis can be averaging, summation, weighted summation, and so on.
  • the embedding layer 11 may convert the label description texts of the K categories into label vectors in advance, and store the obtained K label vectors in the memory for use in classification prediction.
  • K pre-stored tag vectors are read.
  • the respective label description texts of the K categories may be input to the embedding layer, and word embedding may be performed to obtain the label vector of each category.
  • step 22 using the embedding layer 11, word embedding is performed on the input text to obtain a word vector sequence.
  • the embedding layer 11 adopts the aforementioned specific word embedding algorithm to perform word embedding on each word in the input text, so as to obtain the word vector sequence corresponding to the input text. Assuming that the input text contains N words ⁇ w 1 ,w 2 ,...,w N ⁇ arranged in sequence, the word vector sequence X W can be obtained:
  • steps 21 and 22 can be executed in parallel or in any order, which is not limited here.
  • step 23 the above-mentioned word vector sequence is input to the convolution layer 12, and a number of convolution kernels or convolution windows of different widths are used to perform convolution processing on the word vector sequence.
  • a number of convolution kernels or convolution windows of different widths are used to perform convolution processing on the word vector sequence.
  • Fig. 3 shows a schematic diagram of performing convolution processing on a sequence of word vectors in an embodiment.
  • a convolution window with a width of 5 (radius 2) is used for convolution processing.
  • the convolution window covers the current word as the center, and the continuous 5 word vectors formed by the two word vectors before and after, namely Perform convolution operation on these 5 word vectors to get the fragment vector corresponding to the position i
  • the aforementioned convolution operation may be a combination operation of word vectors defined by an activation function.
  • the word vector As the current word Perform convolution operation on the 5 word vectors at the center to obtain the fragment vector corresponding to the position i+1
  • the fragment vectors corresponding to the N positions are obtained, and the fragment vector sequence corresponding to the convolution window is formed
  • the convolution layer uses several convolution windows with different widths for processing. For example, in a specific example, using four convolution windows with widths of 3, 5, 9, 15 and processing the word vector sequence X W separately, four fragment vector sequences X S1 , X S2 , X S3 , X S4 , these four fragment vector sequences respectively represent the representation of the input text at the level of text fragments with lengths of 3, 5, 9, 15 words.
  • the number of convolution windows used and the width of each convolution window can be determined according to factors such as the length of the input text, the length of the text fragments to be considered, and so on, so that several fragment vector sequences are obtained.
  • the above word vector sequence X W and several fragment vector sequences X S can form a vector sequence set.
  • the vector sequences in the set contain N h-dimensional vector elements, which can be simply uniformly denoted as the vector sequence X.
  • each vector sequence X in the above vector sequence set is input to the first attention module in the attention layer, and the first attention processing is performed to obtain each first attention corresponding to each vector sequence X.
  • Sequence vector As mentioned earlier, the first attention module is also called the mutual attention module (with labels). Correspondingly, the first attention processing can also be called label attention processing. Similarity, the corresponding sequence vector is obtained.
  • the first attention processing may include, for each vector element x i in the input vector sequence X, determining the vector element according to the similarity between the vector element x i and the K label vectors obtained in step 21 The first weighting factor corresponding to x i , and the weighted summation of each vector element in the input vector sequence using the first weighting factor, to obtain the first sequence vector V1(X) corresponding to the input vector sequence X.
  • determining the first weighting factor corresponding to the vector element x i can be performed in the following manner.
  • the similarity a ij between the vector element x i and the label vector l j can be calculated by the cosine similarity, as shown in the following formula (1):
  • the similarity a ij between the vector element x i and the label vector l j can also be determined based on the Euclidean distance between the two. The greater the distance, the smaller the similarity.
  • the similarity a ij can also be directly determined as the dot product (inner product) result of the vector element x i and the label vector l j In more examples, the similarity can also be determined in other ways.
  • the maximum value can be determined, and the first weighting factor corresponding to the vector element x i can be determined based on the maximum value.
  • the corresponding K label vectors are usually far away from each other in the corresponding vector space.
  • the vector element x i has a high similarity with any label vector l j , it means that the word or text fragment corresponding to the vector element may have a greater relationship with the corresponding category j. Therefore, the vector element should be given Xi is more concerned or attention (attention), giving it a higher weight. Therefore, in the above steps, the first weighting factor of the vector element is determined according to the maximum value of the similarity.
  • the maximum value of the K similarities is directly used as the first weighting factor corresponding to the vector element x i
  • the maximum value of the vector elements corresponding to the K x i of the similarity score is determined for a i x i mutual attention vector element, and similarly, the input vector to obtain a sequence of vector elements in each of the respective The corresponding mutual attention scores. Then, according to each mutual attention score corresponding to each vector element, normalize the mutual attention score a i of the vector element x i to obtain the first weighting factor corresponding to the vector element
  • the above-mentioned normalization processing is realized by the softmax function, as shown in the following formula (2):
  • the first attention module can weight and sum each vector element based on the first weight factor to obtain the first weight factor of the input vector sequence X.
  • Sequence vector V1(X) namely:
  • Fig. 4 shows a schematic diagram of performing first attention processing on an input vector sequence in an embodiment.
  • the K-dimensional similarity matrix is called the label attention matrix.
  • Perform the maximum pooling operation on the label attention matrix that is, select the maximum value in a column corresponding to each vector element to obtain the mutual attention score of each vector element, and then obtain its weighting factor based on the mutual attention score, based on the weight
  • the factor weights and sums each vector element to obtain the first sequence vector representation V1 of the input vector sequence.
  • the corresponding first sequence vectors can be obtained respectively.
  • the word vector sequence X W obtains the corresponding first sequence vector V1 (X W )
  • the several fragment vector sequences X S obtain the corresponding several first sequence vectors V1 (X S ).
  • step 25 the first attention representation S label of the input text is obtained according to each first sequence vector corresponding to each of the above vector sequences.
  • each first sequence vector including V1(X W ) and several V1(X S ), can be synthesized.
  • the synthesis method can include summation, weighted summation, averaging, etc., so that the first sequence vector is obtained. Attention means S label .
  • the characterization vector S of the input text is determined according to at least the above-mentioned first attention representation S label.
  • the first attention representation can be used as the characterization vector S.
  • step 27 the characterization vector S is input to the classifier 14, and through the operation of the classifier, the category prediction results of the input text in the K categories are obtained.
  • the convolutional layer 13 may further include a second attention module 132 and/or a third attention module 133.
  • the processing procedures of the second attention module and the third attention module are described below.
  • the second attention module 132 is also known as the intra-attention module, which is used to perform calculations on each vector element according to the similarity between each vector element in the input vector sequence and other vector elements. comprehensive.
  • the module 132 when the vector sequence X is input to the second attention module 132, the module 132 performs second attention processing on the input vector sequence X, also called internal attention processing.
  • the internal attention processing specifically includes: For each vector element x i in the input vector sequence X, according to the similarity between the vector element and each other vector element x j in the input vector sequence X, determine the second weighting factor corresponding to the vector element x i, and use The second weighting factor weights and sums each vector element in the input sequence to obtain the second sequence vector V2(X) corresponding to the input vector sequence X.
  • determining the second weighting factor corresponding to the vector element x i can be performed in the following manner.
  • each similarity a ij between the vector element x i and each other vector element x j calculates each similarity a ij between the vector element x i and each other vector element x j .
  • the calculation of the similarity can adopt the cosine similarity, or it can be determined based on other methods such as the vector distance, the vector dot multiplication result, etc., which will not be repeated here.
  • the mean value of the aforementioned similarity is directly used as the second weighting factor corresponding to the vector element x i
  • the elements of the vector x i corresponding to the mean similarity score is determined for a i x i of vector elements within the attention, and attention based on the scores of each vector element, for example by a softmax normalization function Processing to obtain the second weighting factor corresponding to the vector element x i
  • the second attention module can weight and sum each vector element based on the second weighting factor to obtain the second weighting factor of the input vector sequence X.
  • Sequence vector V2(X) namely:
  • Fig. 5 shows a schematic diagram of performing second attention processing on an input vector sequence in an embodiment.
  • the N vector elements in the input vector sequence are arranged into rows and columns, respectively, and the similarity between the two vector elements x i and x j is calculated, so that an N*N-dimensional
  • the similarity matrix is called the internal attention matrix.
  • Perform an average pooling operation on the internal attention matrix that is, calculate the average value of a column of similarity values corresponding to each vector element to obtain the internal attention score of each vector element, and then obtain its weighting factor based on the internal attention score, Based on the weighting factor, each vector element is weighted and summed to obtain the second sequence vector representation V2 of the input vector sequence.
  • Each vector sequence X in the aforementioned vector set can be input to the second attention module 132 to perform the above-mentioned internal processing power processing, so as to obtain respectively corresponding second sequence vectors V2(X), including V2 corresponding to the word vector sequence X W (X W ), a number of second sequence vectors V2(X S ) corresponding to a number of segment vector sequences X S.
  • the second sequence vectors V2(X) corresponding to the above vector sequences can be synthesized to obtain the second attention representation S intra of the input text.
  • the step 26 of determining the characterization vector S in FIG. 2 may include, based on the first attention representation S label and the second attention Denote S intra, determine the characterization vector S.
  • the first attention representation S label and the second attention representation S intra can be synthesized through a variety of methods, such as summation, weighted summation, averaging, etc., to obtain the characterization vector S.
  • the attention layer 13 may further include a third attention module 133.
  • the third attention module 133 may be referred to as a self-attention module, which is used to perform self-attention processing, that is, to synthesize each vector element according to the similarity between each vector element in the input vector sequence and the attention vector.
  • the self-attention module 133 maintains an attention vector v, which has the same dimension as the vector obtained by word embedding, and both are h-dimensional.
  • the parameters contained in the attention vector v can be determined through training.
  • the third attention module 133 is based on a total sequence X'formed based on each vector sequence in the vector sequence set.
  • the third attention module 133 performs third attention processing on the total sequence X', that is, self-attention processing, which specifically includes, for each vector element x i in the total sequence X', according to the vector element x i
  • the similarity with the attention vector v, the third weighting factor corresponding to the vector element is determined, and the third weighting factor is used to weight and sum each vector element in the total sequence to obtain the third attention of the input text Force expression.
  • determining the third weighting factor corresponding to the vector element x i can be performed in the following manner.
  • the calculation of similarity can adopt cosine similarity, or it can be determined based on other methods such as vector distance, vector dot multiplication result, etc., which will not be repeated here.
  • the third weighting factor corresponding to the vector element x i determines the third weighting factor corresponding to the vector element x i.
  • the above self-attention score is directly used as the third weighting factor corresponding to the vector element x i
  • the third weighting factor corresponding to the vector element x i is obtained
  • the similarity between the vector element x i and the attention vector v is calculated by the vector dot product, and the normalization is by the softmax function, so that the following third weighting factor can be obtained:
  • v T is the transposition of the attention vector v
  • M is the number of vector elements contained in the total sequence X'.
  • the third attention module can weight and sum each vector element based on the third weighting factor. Since the total sequence already contains the information of each vector sequence, the result of processing the total sequence can be directly used as the third attention representation S self of the input text, namely:
  • the above third attention module 133 performs self-attention processing on the total sequence X'formed by splicing each vector sequence together to obtain a third attention representation.
  • each vector sequence can be fused and transformed to obtain a corresponding fusion sequence, and the fusion sequence and each vector sequence can be spliced together to form a more comprehensive total sequence X'.
  • the attention layer 13 further includes a fusion module, which is used to perform fusion conversion processing on the input vector sequence X and convert it into a corresponding fusion sequence Q.
  • the fusion conversion processing may specifically include, for each vector element x i in the input vector sequence X, determining the difference with each label according to the similarity between the vector element x i and each label vector l j in the aforementioned K label vectors The label weight factor corresponding to the vector l j , and based on the label weight factor, the vector element x i is converted into the fusion vector q i of the weighted summation of K label vectors, thereby converting the input vector sequence X into the corresponding fusion sequence Q .
  • the process of correspondingly transforming the vector element x i into the fusion vector q i can be performed in the following manner.
  • the similarity calculation method can be realized by, for example, formula (1), or it can be determined based on vector distance, dot multiplication operation, etc., and will not be repeated.
  • the label weight factor ⁇ j corresponding to each label vector l j is determined.
  • the label similarity weight directly as the label similarity weight a ij tag vectors l j corresponding weighting factor ⁇ j.
  • the similarity of a ij is normalized, as the tag label weight vectors corresponding to the weight factor l j ⁇ j.
  • the label weight factor can be determined by the following formula:
  • the weighted sum of each label vector can be based on the label weight factor, thereby converting the vector element x i into the fusion vector q i :
  • Fig. 6 shows a schematic diagram of performing fusion conversion processing on an input vector sequence in an embodiment.
  • N vector elements in the input vector sequence X as columns and K label vectors as rows, respectively calculate the similarity between each vector element x i and each label vector l j , which can form a Similarity matrix.
  • For each vector element x i based on each similarity in the row corresponding to the vector element in the similarity matrix, determine the label weighting factor corresponding to each label vector, and weighted and sum each label vector based on the label weighting factor to obtain the vector element corresponding fusion vector x i q i.
  • each vector element x i in the input vector sequence X into a corresponding fusion vector q i
  • the vector sequence X can be converted into a fusion sequence Q.
  • each corresponding fusion sequence can be obtained, for example , the fusion sequence Q W corresponding to the word vector sequence X W and the fusion sequence corresponding to the fragment vector sequence X S Q S.
  • each original vector sequence (X W X S1 X S2 %) and each fusion sequence (Q W Q S1 Q S2 ...) obtained as above can be spliced to obtain the total sequence X'.
  • the third attention module 133 is used to process the total sequence X′ to obtain the third attention representation S self .
  • the step 26 of determining the characterization vector S in FIG. 2 may include, based on the first attention representation S label and the third attention The force represents S self , and the characterization vector S is determined.
  • the first attention representation S label and the third attention representation S self can be synthesized in a variety of ways to obtain the characterization vector S.
  • the step 26 of determining the characterization vector S in FIG. 2 may include, based on the first attention representation S label , the second attention represents S intra and the third attention represents S self , and the characterization vector S is determined.
  • the weighted summation of the first attention representation, the second attention representation, and the third attention representation can be based on a predetermined weight coefficient to obtain the characterization vector S, as shown in the following formula:
  • ⁇ 1 , ⁇ 2 , and ⁇ 3 are weight coefficients, which may be pre-set hyperparameters.
  • FIG. 7 shows a schematic diagram of attention processing of the attention layer in an embodiment.
  • the schematic diagram shows the input and output of each attention module when the attention layer contains the first, second and third attention modules.
  • the input of the first attention module includes a vector sequence set consisting of a word vector sequence X W and a segment vector sequence X S , and K label vectors.
  • the first attention module obtains the first sequence vector of the vector sequence according to the similarity between the vector elements and the K label vectors. By synthesizing each first sequence vector, the first attention representation S label of the input text can be obtained.
  • the input of the second attention module includes the aforementioned vector sequence set. For each vector sequence X in the set, the second attention module obtains the second sequence vector of the vector sequence according to the similarity between the various vector elements. By synthesizing each second sequence vector, the second attention representation S intra of the input text can be obtained.
  • the input of the fusion module includes the aforementioned vector sequence set and K label vectors.
  • the fusion module converts each vector sequence X in the vector sequence set into a fusion vector Q through fusion conversion processing. Then, each fusion sequence corresponding to each vector sequence in the vector sequence set is output.
  • the input of the third attention module is each vector sequence in the aforementioned vector sequence set, and the total sequence formed by the synthesis of each fusion sequence.
  • the third attention module performs self-attention processing on the total sequence, and obtains the third attention representation S self of the input text.
  • the final characterization vector of the input text can be synthesized based on the output of the first, second and third attention modules.
  • the attention layer includes the first attention module, and the attention layer also includes the second attention module and/or the third attention module.
  • the process of classifying and predicting the input text is not only applicable to the training phase of the text classification model, but also applicable to the use phase after the model training is completed.
  • the input text input to the model is training text
  • the training text corresponds to a category label y indicating its true category.
  • the model needs to be trained based on the foregoing category prediction result.
  • the training process is shown in FIG. 8.
  • FIG. 8 shows the method steps further included in the model training stage.
  • the text prediction loss L text is obtained according to the category prediction result y′ for the training text and the category label y of the training text.
  • the category prediction result y' is obtained by the classifier 14 using a predetermined classification function to perform operations on the characterization vector S of the input text. Therefore, the category prediction result can be expressed as:
  • f c is the classification function.
  • the category prediction result y′ includes the probability that the predicted current training text belongs to the predetermined K categories. Therefore, the text prediction loss L text can be obtained based on the probability distribution indicated by the category prediction result y′ and the actual classification indicated by the category label y through a loss function in the form of cross entropy. In other embodiments, other known loss function forms can also be used to obtain the text prediction loss L text .
  • the total prediction loss L is determined based on at least the aforementioned text prediction loss L text.
  • the text prediction loss is determined as the total prediction loss L.
  • step 83 the text classification model is updated in the direction that reduces the total prediction loss L.
  • gradient descent, back propagation and other methods can be used to adjust the model parameters in the text classification model, so that the total prediction loss L is reduced until a predetermined convergence condition is reached, thereby realizing the training of the model.
  • K label vectors l j (j from 1 to K) corresponding to the K categories can be input to the classifier 14 respectively, so that the classifier 14 performs classification prediction based on the input label vector to obtain the corresponding K label predictions as a result
  • vectors l j where tag labels corresponding prediction result y "j can be expressed as:
  • the K categories and their corresponding label prediction results are respectively compared, and the label prediction loss L label is obtained based on the comparison results.
  • a cross-entropy loss function can be used to obtain the label prediction loss under the category, and then the label prediction loss of each category is summed to obtain the total label prediction loss L label .
  • the step 82 of determining the total loss in FIG. 8 may include determining the total loss L according to the text prediction loss L text and the label prediction loss L label .
  • the total loss L may be determined as:
  • is a hyperparameter.
  • the classifier can be more targeted for better training.
  • the text classification model can be used to classify and predict the input text of unknown category.
  • the classification prediction model combines the semantic information of different length text segment levels and the semantic information of the label description text, the classification prediction of the text can be realized with higher accuracy.
  • a device for classification prediction using a text classification model is provided.
  • the device is used to predict the category corresponding to the input text in the predetermined K categories.
  • the text classification model used includes an embedding layer, Convolutional layer, attention layer and classifier, attention layer further includes the first attention module, as shown in Figure 1.
  • the above classification prediction device can be deployed in any device, platform or device cluster with computing and processing capabilities.
  • Fig. 9 shows a schematic block diagram of a text classification prediction device according to an embodiment. As shown in FIG. 9, the prediction device 900 includes the following units.
  • the label vector obtaining unit 901 is configured to obtain K label vectors respectively corresponding to the K categories, where each label vector is obtained by word embedding the label description text of the corresponding category;
  • the word sequence obtaining unit 902 is configured to use the embedding layer to perform word embedding on the input text to obtain a word vector sequence;
  • the segment sequence acquiring unit 903 is configured to input the word vector sequence into the convolutional layer, and the convolution layer uses a number of convolution windows corresponding to a number of text segments of different lengths to convolve the word vector sequence Product processing to obtain several fragment vector sequences; the word vector sequence and several fragment vector sequences constitute a vector sequence set;
  • the first attention unit 904 is configured to input each vector sequence in the vector sequence set into the first attention module to perform first attention processing to obtain each first sequence vector corresponding to each vector sequence; Wherein, the first attention processing includes determining the first weighting factor corresponding to each vector element according to the similarity between each vector element in the input vector sequence and the K label vectors, and using the first weighting factor.
  • the weighting factor is a weighted summation of each vector element;
  • the first representation obtaining unit 905 is configured to obtain the first attention representation of the input text according to the respective first sequence vectors
  • a characterization vector determining unit 906, configured to determine a characterization vector of the input text at least according to the first attention expression
  • the prediction result obtaining unit 907 is configured to input the characterization vector into the classifier to obtain category prediction results of the input text in the K categories.
  • the input text is a user question; correspondingly, the label description text corresponding to each of the K categories includes a standard question description text.
  • the label vector obtaining unit 901 is configured to predetermine the K label vectors in the following manner: for each of the K categories, obtain the label description text corresponding to the category; The description text is word-embedded to obtain the word vector of each description word contained in the label description text; the word vectors of each description word are synthesized to obtain the label vector corresponding to the category.
  • the first weighting factor corresponding to each vector element is determined by the following method: For each vector element in the input vector sequence, calculate the vector element K similarities with the K label vectors; based on the maximum value of the K similarities, the first weighting factor corresponding to the vector element is determined.
  • the K similarities between the vector element and the K label vectors can be calculated by: calculating the cosine similarity between the vector element and each label vector; or, based on the vector element and each label The Euclidean distance between the vectors determines the similarity; or, based on the dot product result of the vector element and each label vector, the similarity is determined.
  • determining the first weighting factor corresponding to the vector element based on the maximum value of the K similarities may include: determining the mutual attention of the vector element based on the maximum value of the K similarities Force score; according to each mutual attention score corresponding to each vector element, normalize the mutual attention score of the vector element to obtain the first weighting factor corresponding to the vector element.
  • obtaining the first attention representation of the input text according to the respective first sequence vectors includes: according to an embodiment, by synthesizing the respective first sequence vectors to obtain the first Attention means that the synthesis includes one of the following: summation, weighted summation, and averaging.
  • the attention layer of the text classification model further includes a second attention module.
  • the device 900 further includes (not shown in the figure) a second attention unit and a second representation acquisition unit, wherein: the second attention unit is configured to separately set each vector sequence in the vector sequence set Input the second attention module to perform second attention processing to obtain each second sequence vector corresponding to each vector sequence; wherein, the second attention processing includes, for each vector element in the input vector sequence , According to the similarity between the vector element and each other vector element in the input vector sequence, determine the second weighting factor corresponding to the vector element, and use the second weighting factor to weight each vector element in the input sequence And; the second representation obtaining unit is configured to obtain a second attention representation of the input text according to the respective second sequence vectors.
  • the characterization vector determining unit 906 in FIG. 9 is configured to determine the characterization vector according to the first attention expression and the second attention expression.
  • the second weighting factor corresponding to the vector element can be determined in the following manner: calculating the respective similarities between the vector element and the other vector elements; Based on the average of the respective similarities, the second weighting factor corresponding to the vector element is determined.
  • the attention layer further includes a third attention module in which attention vectors are maintained.
  • the device 900 further includes (not shown in the figure) a total sequence forming unit and a third attention unit, wherein,
  • the total sequence forming unit is configured to form a total sequence based at least on the splicing of each vector sequence in the vector sequence set; the third attention unit is configured to use the third attention module to perform a third operation on the total sequence Attention processing, the third attention processing includes, for each vector element in the total sequence, determining a third weight corresponding to the vector element according to the similarity between the vector element and the attention vector Factor, and use the third weighting factor to weight and sum each vector element in the total sequence to obtain the third attention representation of the input text.
  • the aforementioned characterization vector determining unit 906 is configured to determine the characterization according to the first attention expression and the third attention expression vector.
  • the aforementioned characterization vector determining unit 906 is configured to, according to the first attention expression, the second attention expression and The third attention representation determines the characterization vector.
  • the characterization vector determining unit 906 may perform a weighted summation of the first attention expression, the second attention expression and the third attention expression based on a predetermined weight coefficient to obtain the Representation vector.
  • the attention layer further includes a fusion module.
  • the device 900 further includes a fusion unit (not shown) configured to input each vector sequence in the vector sequence set into the fusion module to perform fusion conversion processing to obtain each fusion sequence corresponding to each vector sequence,
  • the fusion conversion processing includes, for each vector element in the input vector sequence, determining the label weighting factor corresponding to each label vector according to the similarity between the vector element and each of the K label vectors , And based on the tag weight factor, convert the vector element into a fusion vector of the K tag vectors weighted and sum, thereby converting the input vector sequence into a corresponding fusion sequence.
  • the total sequence forming unit may be configured to splice the respective vector sequences and the respective fusion sequences to obtain the total sequence.
  • the input text is training text
  • the training text correspondingly has a category label indicating its true category
  • the device 900 further includes a training unit (not shown) configured to predict according to the category As a result and the category label, the text prediction loss is obtained; at least the total prediction loss is determined according to the text prediction loss; and the text classification model is updated in the direction that reduces the total prediction loss.
  • the training unit is further configured to: input the K label vectors corresponding to the K categories into the classifier to obtain the corresponding K prediction results; respectively compare the K categories with their corresponding According to the prediction result, the label prediction loss is obtained based on the comparison result; the total loss is determined according to the text prediction loss and the label prediction loss.
  • the text classification model is used to achieve accurate classification of the input text.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.
  • a computing device including a memory and a processor, the memory is stored with executable code, and when the processor executes the executable code, it implements the method described in conjunction with FIG. method.

Abstract

A method and apparatus for carrying out classification prediction by using a text classification model. The text classification model comprises an embedding layer, a convolutional layer, an attention layer and a classifier. The method for carrying out classification prediction comprises: carrying out word embedding on tag description text corresponding to K categories in advance to obtain K tag vectors; during prediction, carrying out word embedding on input text by using an embedding layer so as to obtain a word vector sequence; at a convolutional layer, carrying out convolution processing on the word vector sequence by using convolution windows of different widths so as to obtain a fragment vector sequence; then, at an attention layer, respectively carrying out first attention processing on each vector sequence, wherein the first attention processing comprises determining, according to the similarity between a vector element in the vector sequence and the K tag vectors, a weight factor of the vector element, and then carrying out weighted summation to obtain a first sequence vector; and obtaining a representation vector of the input text on the basis of the first sequence vector of each sequence, and a classifier obtaining a category prediction result of the input text on the basis of the representation vector.

Description

利用文本分类模型进行分类预测的方法及装置Method and device for classification prediction using text classification model 技术领域Technical field
本说明书一个或多个实施例涉及机器学习领域,尤其涉及利用文本分类模型进行分类预测的方法和装置。One or more embodiments of the present specification relate to the field of machine learning, and in particular, to a method and device for classification prediction using a text classification model.
背景技术Background technique
文本分类是计算机执行的自然语言处理任务中常见而典型的一类任务,广泛应用于多种业务实施场景。例如,在智能问答客服系统中,需要将用户提出的问题作为输入文本进行分类,以进行用户意图识别、自动问答,或者人工客服派单等。更具体的,当进行自动问答时,分类的类别可以对应于,预先整理好的各种标准问题。相应地,通过对用户问题的分类,可以确定出用户随意而口语化的问题描述所对应的标准问题,进而可以确定出问题的答案,推送给用户。当进行人工客服派单时,分类的类别可以对应于,针对不同知识领域进行培训的人工客服技能组。相应地,通过对用户问题的分类,可以识别出用户问题所属的知识领域,进而将用户问题分配给对应技能组的人工客服。文本分类还可用于,例如文档数据归类,舆情分析,垃圾信息识别等等多种应用场景。Text classification is a common and typical type of natural language processing tasks performed by computers, and is widely used in a variety of business implementation scenarios. For example, in an intelligent question answering customer service system, the questions raised by the user need to be classified as input text for user intention recognition, automatic question and answer, or manual customer service dispatch. More specifically, when performing automatic question and answer, the classified categories can correspond to various standard questions organized in advance. Correspondingly, by categorizing user questions, the standard question corresponding to the user's casual and colloquial question description can be determined, and then the answer to the question can be determined and pushed to the user. When dispatching manual customer service orders, the classified categories can correspond to the manual customer service skill sets that are trained for different knowledge fields. Correspondingly, by categorizing user questions, it is possible to identify the knowledge domain to which the user question belongs, and then assign the user question to the manual customer service of the corresponding skill group. Text classification can also be used in various application scenarios, such as document data classification, public opinion analysis, spam identification, and so on.
在各种业务实施场景中,文本分类的准确性都是关注的核心问题。因此,希望能有改进的方案,可以进一步提升文本分类的准确度。In various business implementation scenarios, the accuracy of text classification is the core issue of concern. Therefore, it is hoped that there will be an improved solution that can further improve the accuracy of text classification.
发明内容Summary of the invention
本说明书一个或多个实施例描述了利用文本分类模型进行文本分类预测的方法和装置,其中的文本分类模型综合考虑不同长度的文本片段的语义信息,以及与标签描述文本的相关度信息,进行文本分类预测,从而提高分类预测的准确性和效率。One or more embodiments of this specification describe a method and device for text classification prediction using a text classification model. The text classification model comprehensively considers the semantic information of text fragments of different lengths and the correlation information with the label description text. Text classification prediction, thereby improving the accuracy and efficiency of classification prediction.
根据第一方面,提供了一种利用文本分类模型进行分类预测的方法,用于在预定的K个类别中预测输入文本对应的类别;所述文本分类模型包括嵌入层,卷积层,注意力层和分类器,所述注意力层包括第一注意力模块,所述方法包括:获取所述K个类别分别对应的K个标签向量,其中,每个标签向量通过对相应类别的标签描述文本进行词嵌入而得到;利用所述嵌入层,对输入文本进行词嵌入,得到词向量序列;将所述词向量序列输入所述卷积层,所述卷积层利用与若干不同长度的文本片段相对应的若干卷积窗口,对所述词向量序列进行卷积处理,得到若干片段向量序列;所述词向量序列和若干片段向量序列构成向量序列集合;分别将所述向量序列集合中的各个向量序列输入所述第一注意力模块,以进行第一注意力处理,得到各个向量序列对应的各个第一序列向量;其中,所述第一注意力处理包括,根据输入向量序列中各个向量元素与所述K个标签向量之间的相似度,确定各个向量元素分别对应的第一权重因子,并利用所述第一权重因子对各个向量元素加权求和;根据所述各个第一序列向量,得到所述输入文本的第一注意力表示;至少根据所述第一注意力表示,确定所述输入文本的表征向量;将所述表征向量输入所述分类器,得到所述输入文本在所述K个类别中的类别预测结果。According to the first aspect, a method for classification prediction using a text classification model is provided, which is used to predict the category corresponding to the input text in predetermined K categories; the text classification model includes an embedding layer, a convolutional layer, and attention Layer and classifier, the attention layer includes a first attention module, and the method includes: obtaining K label vectors corresponding to the K categories, wherein each label vector describes the text of the corresponding category by the label vector Obtained by word embedding; using the embedding layer to perform word embedding on the input text to obtain a word vector sequence; inputting the word vector sequence into the convolutional layer, and the convolutional layer uses several text fragments of different lengths Corresponding several convolution windows, perform convolution processing on the word vector sequence to obtain several fragment vector sequences; the word vector sequence and several fragment vector sequences constitute a vector sequence set; each of the vector sequence sets is separately The vector sequence is input to the first attention module to perform first attention processing to obtain each first sequence vector corresponding to each vector sequence; wherein, the first attention processing includes, according to each vector element in the input vector sequence Determine the first weighting factors corresponding to the K label vectors respectively, and use the first weighting factors to weight and sum each vector element; according to the first sequence vectors, Obtain the first attention representation of the input text; determine the characterization vector of the input text according to at least the first attention expression; input the characterization vector into the classifier to obtain the input text in the Category prediction results in K categories.
在一个实施例中,输入文本为用户问题;相应的,K个类别中各个类别对应的标 签描述文本包括,标准问题描述文本。In one embodiment, the input text is a user question; correspondingly, the label description text corresponding to each of the K categories includes a standard question description text.
在一种实施方式中,所述K个标签向量通过以下方式预先确定:对所述K个类别中的每个类别,获取该类别对应的标签描述文本;对所述标签描述文本进行词嵌入,得到该标签描述文本中包含的各个描述词的词向量;对所述各个描述词的词向量进行综合,得到该类别对应的标签向量。In an embodiment, the K label vectors are predetermined in the following manner: for each of the K categories, a label description text corresponding to the category is obtained; word embedding is performed on the label description text, Obtain the word vector of each description word contained in the label description text; synthesize the word vectors of each description word to obtain the label vector corresponding to the category.
根据一个实施例,在第一注意力处理中,具体通过以下方式确定各个向量元素对应的第一权重因子:对于输入向量序列中每个向量元素,计算该向量元素与所述K个标签向量之间的K个相似度;基于所述K个相似度中的最大值,确定该向量元素对应的第一权重因子。According to one embodiment, in the first attention processing, the first weighting factor corresponding to each vector element is specifically determined by the following method: For each vector element in the input vector sequence, calculate the difference between the vector element and the K label vectors K similarities between the two; based on the maximum value of the K similarities, the first weighting factor corresponding to the vector element is determined.
更具体的,在不同实施例中,计算该向量元素与所述K个标签向量之间的K个相似度可以包括:计算该向量元素与各个标签向量之间的余弦相似度;或者,基于该向量元素与各个标签向量之间的欧式距离,确定其相似度;或者,基于该向量元素与各个标签向量的点乘结果,确定其相似度。More specifically, in different embodiments, calculating the K similarities between the vector element and the K label vectors may include: calculating the cosine similarity between the vector element and each label vector; or, based on the The Euclidean distance between the vector element and each label vector determines the similarity; or, based on the dot product result of the vector element and each label vector, the similarity is determined.
此外,在一个实施例中,基于所述K个相似度中的最大值,确定该向量元素对应的第一权重因子具体包括:基于所述K个相似度中的最大值,确定该向量元素的互注意力分数;根据所述各个向量元素对应的各个互注意力分数,对该向量元素的互注意力分数进行归一化处理,得到该向量元素对应的第一权重因子。In addition, in an embodiment, based on the maximum value of the K similarities, determining the first weighting factor corresponding to the vector element specifically includes: determining the value of the vector element based on the maximum value of the K similarities Mutual attention score; according to each mutual attention score corresponding to each vector element, normalize the mutual attention score of the vector element to obtain the first weighting factor corresponding to the vector element.
在一个实施例中,根据所述各个第一序列向量,得到所述输入文本的第一注意力表示具体可以包括:对所述各个第一序列向量进行综合,得到所述第一注意力表示,所述综合包括以下之一:求和、加权求和、求平均。In an embodiment, obtaining the first attention representation of the input text according to the respective first sequence vectors may specifically include: synthesizing the respective first sequence vectors to obtain the first attention representation, The synthesis includes one of the following: summation, weighted summation, and averaging.
根据一种实施方式,注意力层还可以包括第二注意力模块;相应的,所述方法还包括,分别将所述向量序列集合中的各个向量序列输入所述第二注意力模块,以进行第二注意力处理,得到各个向量序列对应的各个第二序列向量;其中,所述第二注意力处理包括,对于输入向量序列中的每个向量元素,根据该向量元素与所述输入向量序列中各个其他向量元素之间的相似度,确定该向量元素对应的第二权重因子,并利用所述第二权重因子对输入序列中各个向量元素加权求和;根据所述各个第二序列向量,得到所述输入文本的第二注意力表示。According to an embodiment, the attention layer may further include a second attention module; correspondingly, the method further includes inputting each vector sequence in the vector sequence set into the second attention module to perform The second attention processing obtains each second sequence vector corresponding to each vector sequence; wherein, the second attention processing includes, for each vector element in the input vector sequence, according to the vector element and the input vector sequence Determine the second weighting factor corresponding to the vector element, and use the second weighting factor to weight and sum each vector element in the input sequence; according to each second sequence vector, Obtain the second attention representation of the input text.
在注意力层包括第一注意力模块和第二注意力模块的情况下,可以根据所述第一注意力表示和所述第二注意力表示,确定所述表征向量。In the case where the attention layer includes a first attention module and a second attention module, the characterization vector may be determined according to the first attention representation and the second attention representation.
进一步的,在第二注意力处理中,可以通过以下方式确定向量元素对应的第二权重因子:计算该向量元素与所述各个其他向量元素之间的各个相似度;基于所述各个相似度的平均值,确定该向量元素对应的第二权重因子。Further, in the second attention processing, the second weighting factor corresponding to the vector element may be determined in the following manner: calculating each similarity between the vector element and the other vector elements; The average value determines the second weighting factor corresponding to the vector element.
根据又一实施方式,所述注意力层还包括第三注意力模块,其中维护注意力向量;所述方法还包括,至少基于所述向量序列集合中各个向量序列的拼接,形成总序列;利用所述第三注意力模块,对所述总序列进行第三注意力处理,所述第三注意力处理包括,对于所述总序列中的每个向量元素,根据该向量元素与所述注意力向量之间的相似度, 确定该向量元素对应的第三权重因子,并利用所述第三权重因子对所述总序列中各个向量元素加权求和,得到所述输入文本的第三注意力表示。According to another embodiment, the attention layer further includes a third attention module in which attention vectors are maintained; the method further includes forming a total sequence based at least on the splicing of each vector sequence in the vector sequence set; using The third attention module performs third attention processing on the total sequence, and the third attention processing includes, for each vector element in the total sequence, according to the relationship between the vector element and the attention The similarity between the vectors is determined, the third weighting factor corresponding to the vector element is determined, and the third weighting factor is used to weight and sum each vector element in the total sequence to obtain the third attention representation of the input text .
在注意力层包括第一注意力模块和第三注意力模块的情况下,可以根据所述第一注意力表示和所述第三注意力表示,确定所述表征向量。In the case that the attention layer includes a first attention module and a third attention module, the characterization vector may be determined according to the first attention representation and the third attention representation.
在注意力层包括第一注意力模块、第二注意力模块和第三注意力模块的情况下,可以根据所述第一注意力表示,所述第二注意力表示和所述第三注意力表示,确定所述表征向量。In the case that the attention layer includes a first attention module, a second attention module, and a third attention module, it may be based on the first attention representation, the second attention representation and the third attention Indicates that the characterization vector is determined.
进一步地,在一个例子中,可以基于预先确定的权重系数,对所述第一注意力表示,所述第二注意力表示和第三注意力表示加权求和,得到所述表征向量。Further, in an example, the first attention expression, the second attention expression and the third attention expression may be weighted and summed based on a predetermined weight coefficient to obtain the characterization vector.
在一个实施例中,注意力层还包括融合模块;在形成输入到第三注意力模块的总序列之前,所述方法还包括:分别将所述向量序列集合中的各个向量序列输入所述融合模块进行融合转换处理,得到各个向量序列对应的各个融合序列,其中所述融合转换处理包括,对于输入向量序列中的每个向量元素,根据该向量元素与所述K个标签向量中各个标签向量之间的相似度,确定与各个标签向量对应的标签权重因子,并基于所述标签权重因子将该向量元素转换为所述K个标签向量加权求和的融合向量,从而将输入向量序列转换为对应的融合序列。In one embodiment, the attention layer further includes a fusion module; before forming the total sequence input to the third attention module, the method further includes: separately inputting each vector sequence in the vector sequence set into the fusion The module performs fusion conversion processing to obtain each fusion sequence corresponding to each vector sequence, where the fusion conversion processing includes, for each vector element in the input vector sequence, according to the vector element and each label vector in the K label vectors The similarity between each tag vector is determined, and the tag weight factor corresponding to each tag vector is determined, and the vector element is converted into the fusion vector of the K tag vector weighted summation based on the tag weight factor, thereby converting the input vector sequence into The corresponding fusion sequence.
相应的,在一个实施例中,可以将各个向量序列和各个融合序列进行拼接,得到所述总序列,输入到第三注意力模块中。Correspondingly, in one embodiment, each vector sequence and each fusion sequence may be spliced to obtain the total sequence, which is input into the third attention module.
根据一种实施方式,所述输入文本为训练文本,所述训练文本对应具有指示其真实类别的类别标签;所述方法还包括:根据所述类别预测结果和所述类别标签,得到文本预测损失;至少根据所述文本预测损失,确定总预测损失;在使得所述总预测损失减小的方向,更新所述文本分类模型,从而训练该文本分类模型。According to an embodiment, the input text is training text, and the training text corresponds to a category label indicating its true category; the method further includes: obtaining a text prediction loss according to the category prediction result and the category label ; Determine the total prediction loss at least according to the text prediction loss; update the text classification model in a direction that reduces the total prediction loss, thereby training the text classification model.
进一步的,在该实施方式下的一个实施例中,方法还包括:将所述K个类别对应的K个标签向量分别输入所述分类器,得到对应的K个预测结果;分别比较所述K个类别与其对应的预测结果,基于比较结果得到标签预测损失。在这样的情况下,可以根据所述文本预测损失和所述标签预测损失,确定总损失,从而进行模型训练。Further, in an example of this implementation manner, the method further includes: inputting the K label vectors corresponding to the K categories into the classifier respectively to obtain the corresponding K prediction results; and comparing the K respectively. For each category and its corresponding prediction result, the label prediction loss is obtained based on the comparison result. In this case, the total loss can be determined according to the text prediction loss and the label prediction loss, so as to perform model training.
根据第二方面,提供了一种利用文本分类模型进行分类预测的装置,用于在预定的K个类别中预测输入文本对应的类别;所述文本分类模型包括嵌入层,卷积层,注意力层和分类器,所述注意力层包括第一注意力模块,所述装置包括:标签向量获取单元,配置为获取所述K个类别分别对应的K个标签向量,其中,每个标签向量通过对相应类别的标签描述文本进行词嵌入而得到;词序列获取单元,配置为利用所述嵌入层,对输入文本进行词嵌入,得到词向量序列;片段序列获取单元,配置为将所述词向量序列输入所述卷积层,所述卷积层利用与若干不同长度的文本片段相对应的若干卷积窗口,对所述词向量序列进行卷积处理,得到若干片段向量序列;所述词向量序列和若干片段向量序列构成向量序列集合;第一注意力单元,配置为分别将所述向量序列集合中的各个向量序列输入所述第一注意力模块,以进行第一注意力处理,得到各个向量序列对应 的各个第一序列向量;其中,所述第一注意力处理包括,根据输入向量序列中各个向量元素与所述K个标签向量之间的相似度,确定各个向量元素分别对应的第一权重因子,并利用所述第一权重因子对各个向量元素加权求和;第一表示获取单元,配置为根据所述各个第一序列向量,得到所述输入文本的第一注意力表示;表征向量确定单元,配置为至少根据所述第一注意力表示,确定所述输入文本的表征向量;预测结果获取单元,配置为将所述表征向量输入所述分类器,得到所述输入文本在所述K个类别中的类别预测结果。According to a second aspect, a device for classification prediction using a text classification model is provided, which is used to predict the category corresponding to the input text in predetermined K categories; the text classification model includes an embedding layer, a convolutional layer, and an attention Layer and classifier, the attention layer includes a first attention module, the device includes: a label vector acquisition unit configured to acquire K label vectors corresponding to the K categories, wherein each label vector passes The label description text of the corresponding category is obtained by word embedding; the word sequence obtaining unit is configured to use the embedding layer to perform word embedding on the input text to obtain a word vector sequence; the fragment sequence obtaining unit is configured to obtain the word vector The sequence is input to the convolutional layer, and the convolutional layer uses several convolution windows corresponding to several text fragments of different lengths to perform convolution processing on the word vector sequence to obtain several fragment vector sequences; the word vector A sequence and several fragment vector sequences constitute a vector sequence set; the first attention unit is configured to input each vector sequence in the vector sequence set into the first attention module to perform first attention processing to obtain each Each first sequence vector corresponding to the vector sequence; wherein, the first attention processing includes determining the corresponding first sequence of each vector element according to the similarity between each vector element in the input vector sequence and the K label vectors A weighting factor, and using the first weighting factor to weight and sum each vector element; a first representation obtaining unit configured to obtain a first attention representation of the input text according to each first sequence vector; characterization The vector determining unit is configured to determine the characterization vector of the input text according to at least the first attention expression; the prediction result obtaining unit is configured to input the characterization vector into the classifier to obtain the input text in the The category prediction results in the K categories are described.
根据第三方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行第一方面的方法。According to a third aspect, there is provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect.
根据第四方面,提供了一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现第一方面的方法。According to a fourth aspect, there is provided a computing device, including a memory and a processor, the memory stores executable code, and the processor implements the method of the first aspect when the executable code is executed by the processor.
根据本说明书实施例提供的方法和装置,利用文本分类模型中的卷积层和注意力层,综合考虑不同长度的文本片段以及与标签向量的相似度信息来得到表征向量,由此使得基于表征向量进行文本分类时,更多地考虑到不同长度的上下文语义信息和与标签描述文本的相关度信息,从而得到更准确的类别预测结果。According to the method and device provided in the embodiments of this specification, the convolutional layer and the attention layer in the text classification model are used to comprehensively consider the text fragments of different lengths and the similarity information with the label vector to obtain the characterization vector, thereby making the characterization-based When the vector is used for text classification, more consideration is given to the contextual semantic information of different lengths and the relevance information of the label description text, so as to obtain more accurate category prediction results.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. A person of ordinary skill in the art can obtain other drawings based on these drawings without creative work.
图1为本说明书披露的一个实施例的文本分类模型的示意图;Fig. 1 is a schematic diagram of a text classification model according to an embodiment disclosed in this specification;
图2示出根据一个实施例的利用文本分类模型进行文本分类的方法流程图;Figure 2 shows a flow chart of a method for text classification using a text classification model according to an embodiment;
图3示出在一个实施例中对词向量序列进行卷积处理的示意图;FIG. 3 shows a schematic diagram of performing convolution processing on a word vector sequence in an embodiment;
图4示出在一个实施例中对输入向量序列进行第一注意力处理的示意图;FIG. 4 shows a schematic diagram of performing first attention processing on an input vector sequence in an embodiment;
图5示出在一个实施例中对输入向量序列进行第二注意力处理的示意图;Figure 5 shows a schematic diagram of performing second attention processing on an input vector sequence in an embodiment;
图6示出在一个实施例中对输入向量序列进行融合转换处理的示意图;Fig. 6 shows a schematic diagram of performing fusion conversion processing on an input vector sequence in an embodiment;
图7示出在一个实施例中注意力层的注意力处理示意图;FIG. 7 shows a schematic diagram of attention processing of the attention layer in an embodiment;
图8示出模型训练阶段进一步包含的方法步骤;Figure 8 shows the further method steps involved in the model training stage;
图9示出根据一个实施例的文本分类预测装置的示意性框图。Fig. 9 shows a schematic block diagram of a text classification prediction device according to an embodiment.
具体实施方式Detailed ways
下面结合附图,对本说明书提供的方案进行描述。The following describes the solutions provided in this specification with reference to the accompanying drawings.
如前所述,在诸如智能客服机器人的多种应用场景中,都需要对输入文本进行准确的分类。已经提出多种结构和算法的神经网络模型用于进行文本分类任务,然而,现有模型有些过于复杂,有些过于泛化而准确度不高,仍然存在不足。As mentioned earlier, in a variety of application scenarios such as intelligent customer service robots, it is necessary to accurately classify the input text. A variety of neural network models with structures and algorithms have been proposed for text classification tasks. However, some of the existing models are too complex, some are too generalized and have low accuracy, and still have shortcomings.
考虑到文本分类任务的特点,在本说明书中的实施例中,提出一种新的文本分类模型,该模型通过综合考虑文本片段的信息和标签描述文本的信息,来进一步提升文本 的分类预测效果。Taking into account the characteristics of text classification tasks, in the embodiments of this specification, a new text classification model is proposed, which further improves the classification and prediction effect of text by comprehensively considering the information of text fragments and the information of label description text. .
图1为本说明书披露的一个实施例的文本分类模型的示意图。如图1所示,文本分类模型包括嵌入层11、卷积层12、注意力层13、分类器14。FIG. 1 is a schematic diagram of a text classification model according to an embodiment disclosed in this specification. As shown in FIG. 1, the text classification model includes an embedding layer 11, a convolutional layer 12, an attention layer 13, and a classifier 14.
嵌入层11采用特定的词嵌入算法,将输入的各个词转化为词向量。利用该嵌入层11,可以预先将作为分类目标的K个类别分别对应的标签描述文本转化为K个标签向量。在对输入文本进行分类预测时,嵌入层11将输入文本进行词嵌入,将其转化为词向量序列。The embedding layer 11 uses a specific word embedding algorithm to convert each input word into a word vector. Using the embedding layer 11, the label description texts corresponding to the K categories as classification targets can be converted into K label vectors in advance. When classifying and predicting the input text, the embedding layer 11 embeds the input text and converts it into a sequence of word vectors.
卷积层12用于对词向量序列进行卷积处理。在本说明书的实施例中,为了考虑不同长度的文本片段(text span)对输入文本语义理解的影响,卷积层12采用不同宽度的多个卷积核或卷积窗进行卷积处理,从而得到多个片段向量序列,分别用于在不同长度的文本片段的层级上,表征输入文本。The convolution layer 12 is used to perform convolution processing on the word vector sequence. In the embodiment of this specification, in order to consider the influence of text spans of different lengths on the semantic understanding of the input text, the convolutional layer 12 uses multiple convolution kernels or convolution windows of different widths to perform convolution processing, thereby Obtain multiple fragment vector sequences, which are used to represent the input text at the level of text fragments of different lengths.
注意力层13采用注意力机制,结合标签向量,对上述各个向量序列进行处理。特别地,注意力层13可以包括第一注意力模块131,用于对输入向量序列进行第一注意力处理。该第一注意力处理包括,根据输入向量序列中的各个向量元素与前述K个标签向量之间的相似度,对各个向量元素进行综合,从而得到输入向量序列对应的序列向量。因此,第一注意力处理又可称为标签注意力处理,第一注意力模块又可称为,(与标签的)互注意力(co-attention)模块。The attention layer 13 adopts an attention mechanism and combines label vectors to process the above-mentioned vector sequences. In particular, the attention layer 13 may include a first attention module 131 for performing first attention processing on the input vector sequence. The first attention processing includes synthesizing each vector element according to the similarity between each vector element in the input vector sequence and the aforementioned K label vectors, so as to obtain a sequence vector corresponding to the input vector sequence. Therefore, the first attention processing can also be referred to as tag attention processing, and the first attention module can also be referred to as a co-attention module (with tags).
可选的,注意力层13还可以包括第二注意力模块132和/或第三注意力模块133。第二注意力模块132可称为内注意力(intra-attention)模块,用于根据输入向量序列中每个向量元素与其他向量元素之间的相似度,对各个向量元素进行综合。第三注意力模块133可称为自注意力(self attention)模块,用于根据输入向量序列中各个向量元素与注意力向量的相似度,对各个向量元素进行综合。Optionally, the attention layer 13 may further include a second attention module 132 and/or a third attention module 133. The second attention module 132 may be called an intra-attention module, which is used to synthesize each vector element according to the similarity between each vector element and other vector elements in the input vector sequence. The third attention module 133 may be called a self-attention module, which is used to synthesize each vector element according to the similarity between each vector element in the input vector sequence and the attention vector.
基于各个注意力模块得到的序列向量的综合,可以得到输入文本的表征向量,输入到分类器14中。分类器14基于该表征向量确定输入文本对应的分类,实现文本的分类预测。Based on the synthesis of the sequence vectors obtained by each attention module, the characterization vector of the input text can be obtained and input into the classifier 14. The classifier 14 determines the classification corresponding to the input text based on the characterization vector, and realizes the classification prediction of the text.
由此可以看到,图1所示的文本分类模型至少具有以下特点。首先,该文本分类模型在不同长度的文本片段的层级上,对输入文本进行表征,得到多个片段级向量序列,如此,更好地发掘不同长度的上下文的语义信息。此外,对于有待分类的各个类别,不同于常规技术中仅用无意义的标签(例如编号)来代表类别,本实施例中的文本分类模型对各个类别的标签描述文本也进行词嵌入,得到有语义信息的标签向量表征。并且,通过互注意力模块,基于词向量序列和片段向量序列中各个元素与标签向量的相似度,综合得到各个序列的序列表征。因此,输入文本的最终表征向量中,包含了不同层级(词的层级,不同长度的文本片段层级)的向量序列与标签向量之间的相似度信息,进而更好地利用输入文本的上下文信息和与标签描述文本的语义相似度信息,来进行文本分类,从而提高分类准确度。It can be seen that the text classification model shown in Figure 1 has at least the following characteristics. First, the text classification model characterizes the input text at the level of text fragments of different lengths, and obtains multiple fragment-level vector sequences, so as to better explore the semantic information of contexts of different lengths. In addition, for each category to be classified, unlike conventional techniques that only use meaningless labels (such as numbers) to represent the categories, the text classification model in this embodiment also embeds the label description text of each category to obtain The label vector representation of semantic information. Moreover, through the mutual attention module, the sequence representation of each sequence is synthesized based on the similarity between each element in the word vector sequence and the segment vector sequence and the label vector. Therefore, the final representation vector of the input text contains the similarity information between the vector sequence of different levels (the level of words, the level of text fragments of different lengths) and the label vector, so as to make better use of the context information of the input text and The semantic similarity information of the description text with the label is used to classify the text, thereby improving the classification accuracy.
下面具体描述利用上述文本分类模型进行文本分类的过程。The following specifically describes the process of text classification using the above text classification model.
图2示出根据一个实施例的利用文本分类模型进行文本分类的方法流程图。可以理解,该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。如图2所示,文本分类过程至少包括以下步骤。Fig. 2 shows a flowchart of a method for text classification using a text classification model according to an embodiment. It can be understood that the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities. As shown in Figure 2, the text classification process includes at least the following steps.
在步骤21,获取作为分类目标的K个类别分别对应的K个标签向量,其中,每个标签向量通过对相应类别的标签描述文本进行词嵌入而得到。In step 21, K label vectors corresponding to the K categories as classification targets are obtained, wherein each label vector is obtained by word embedding the label description text of the corresponding category.
可以理解,对于文本分类任务而言,作为分类目标的K个类别是预先确定的。在常规技术中,一般用标签来代表这K个类别,标签例如体现为,1到K的编号,类别的id号,或者K个类别的独热编码,等等。一般地,标签本身往往是不包含语义信息的,仅仅是代表类别的一个代号。然而,各个类别往往具有对应的描述该类别内容特点的描述信息,我们可以将其作为针对标签的描述信息,即标签描述文本。标签描述文本中常常包含与对应类别相关的语义信息。It can be understood that, for a text classification task, the K categories as classification targets are predetermined. In the conventional technology, tags are generally used to represent the K categories. The tags are, for example, the numbers from 1 to K, the id numbers of the categories, or the one-hot codes of the K categories, and so on. Generally, the tag itself often does not contain semantic information, but is just a code for the category. However, each category often has corresponding description information describing the characteristics of the content of the category, and we can use it as the description information for the label, that is, the label description text. The label description text often contains semantic information related to the corresponding category.
例如,在智能客服机器人的自动问答场景中,作为分类目标的K个类别对应于预定的K个标准问题。相应的,每个类别的标签描述文本即为,该类别对应的标准问题描述文本。比如,类别1的标签描述文本为该类别下的标准问题1“花呗怎么还款”,类别2的标签描述文本为该类别下的标准问题2“借呗可以借多少钱”。For example, in an automatic question answering scenario of an intelligent customer service robot, K categories as classification targets correspond to predetermined K standard questions. Correspondingly, the label description text of each category is the standard question description text corresponding to the category. For example, the label description text of category 1 is the standard question 1 "How to repay Huabei" under this category, and the label description text of category 2 is the standard question 2 "How much money can I borrow" under the category.
又例如,在人工客服自动派单的场景中,分类目标即为与预定的K个人工客服技能组对应的K个类别。相应的,每个类别的标签描述文本可以是,对应技能组的描述,例如包括,技能组的知识领域。在其他场景中,也可以对应获取到各个类别对应的标签描述文本。For another example, in a scenario where manual customer service is automatically dispatched, the classification targets are K categories corresponding to the predetermined K manual customer service skill sets. Correspondingly, the label description text of each category may be a description of the corresponding skill group, for example, including the knowledge field of the skill group. In other scenarios, the label description text corresponding to each category can also be correspondingly obtained.
通过对标签描述文本进行词嵌入,可以得到各个类别对应的标签向量。将各个类别的标签描述文本转化为标签向量的过程,可以包括以下步骤。By embedding the label description text, the label vector corresponding to each category can be obtained. The process of converting the label description text of each category into a label vector may include the following steps.
首先,对K个类别中的每个类别Cj,获取该类别Cj对应的标签描述文本,例如“花呗怎么还款”。然后,采用特定的词嵌入算法,将该标签描述文本中包含的各个描述词进行嵌入,得到各个描述词的词向量。上述特定的词嵌入算法可以是已有的词嵌入工具中的算法,例如word2vec,也可以是针对特定文本场景,预先训练得到的词嵌入算法。假定所采用的特定词嵌入算法将各个词转化为h维向量,而该标签描述文本包含m个词,则在该步骤中,得到该标签描述文本对应的m个h维向量。First, for each category Cj in the K categories, obtain the label description text corresponding to the category Cj, for example, "How to repay the money". Then, a specific word embedding algorithm is used to embed each descriptive word contained in the label description text to obtain the word vector of each descriptive word. The aforementioned specific word embedding algorithm may be an algorithm in an existing word embedding tool, such as word2vec, or a pre-trained word embedding algorithm for a specific text scene. Assuming that the specific word embedding algorithm used converts each word into an h-dimensional vector, and the label description text contains m words, in this step, m h-dimensional vectors corresponding to the label description text are obtained.
接着,对各个描述词的词向量进行综合,得到该类别Cj对应的标签向量l j。具体地,可以将上一步骤中得到的m个h维向量进行综合,将综合后得到的h维向量作为标签向量l j。更具体的,上述综合可以是,求均值,求和值,加权求和等等。在各个标签描述文本包含的字数不同的情况下,优选通过求均值得到上述标签向量。 Next, the word vectors of each descriptor are synthesized to obtain the label vector l j corresponding to the category Cj. Specifically, the m h-dimensional vectors obtained in the previous step may be synthesized, and the h-dimensional vector obtained after synthesis may be used as the label vector l j . More specifically, the above-mentioned synthesis can be averaging, summation, weighted summation, and so on. When the number of words contained in each label description text is different, it is preferable to obtain the above-mentioned label vector by averaging.
以上对标签描述文本进行词嵌入的过程可以通过图1的嵌入层11执行。在一个实施例中,嵌入层11可以预先将K个类别的标签描述文本转化为标签向量,并将得到的K个标签向量存储在存储器中,以备分类预测时使用。相应的,在步骤21,读取预先存储的K个标签向量。在另一例子中,也可以在进行分类预测时,将K个类别各自的标签描述文本输入到嵌入层,进行词嵌入,进而得到各个类别的标签向量。The above process of word embedding on the label description text can be performed by the embedding layer 11 in FIG. 1. In one embodiment, the embedding layer 11 may convert the label description texts of the K categories into label vectors in advance, and store the obtained K label vectors in the memory for use in classification prediction. Correspondingly, in step 21, K pre-stored tag vectors are read. In another example, when performing classification prediction, the respective label description texts of the K categories may be input to the embedding layer, and word embedding may be performed to obtain the label vector of each category.
于是,通过以上方式,获取到K个类别分别对应的K个标签向量。Therefore, through the above method, K label vectors corresponding to the K categories are obtained.
此外,在步骤22,利用嵌入层11,对输入文本进行词嵌入,得到词向量序列。如前所述,嵌入层11采用前述特定的词嵌入算法,对输入文本中各个词进行词嵌入,从而得到输入文本对应的词向量序列。假定输入文本中包含依次排列的N个词{w 1,w 2,…,w N},可以得到词向量序列X WIn addition, in step 22, using the embedding layer 11, word embedding is performed on the input text to obtain a word vector sequence. As mentioned above, the embedding layer 11 adopts the aforementioned specific word embedding algorithm to perform word embedding on each word in the input text, so as to obtain the word vector sequence corresponding to the input text. Assuming that the input text contains N words {w 1 ,w 2 ,...,w N } arranged in sequence, the word vector sequence X W can be obtained:
Figure PCTCN2020134518-appb-000001
Figure PCTCN2020134518-appb-000001
其中,
Figure PCTCN2020134518-appb-000002
表示第i个词w i对应的词向量。
in,
Figure PCTCN2020134518-appb-000002
Represents the word vector corresponding to the i-th word w i.
需要理解,步骤21和22,可以并行执行,或者以任意的先后顺序执行,在此不做限定。It should be understood that steps 21 and 22 can be executed in parallel or in any order, which is not limited here.
接着,在步骤23,将上述词向量序列输入卷积层12,利用若干不同宽度的卷积核或卷积窗口,对词向量序列进行卷积处理。这是因为,在文本分类时,上下文对文本语义理解至关重要。然而,对于不同文本中的不同词,有帮助的上下文语义信息可能隐藏在距离当前词不同长度的上下文文本中。因此,发明人提出,在不同长度的文本片段(text span)的层级上,对输入文本进行表征。因此,根据本说明书的实施例,在卷积层12中,采用与若干不同长度的文本片段相对应的若干不同宽度的卷积窗口,对词向量序列进行卷积处理,得到若干片段向量序列。Next, in step 23, the above-mentioned word vector sequence is input to the convolution layer 12, and a number of convolution kernels or convolution windows of different widths are used to perform convolution processing on the word vector sequence. This is because, in text classification, context is critical to text semantic understanding. However, for different words in different texts, helpful contextual semantic information may be hidden in contextual texts of different lengths from the current word. Therefore, the inventor proposes to characterize the input text at the level of text spans of different lengths. Therefore, according to the embodiment of the present specification, in the convolutional layer 12, several convolution windows of different widths corresponding to several text segments of different lengths are used to perform convolution processing on the word vector sequence to obtain several segment vector sequences.
具体地,卷积窗口的宽度W又可表示为W=2r+1,其中r为覆盖半径。采用宽度W=2r+1的窗口对词向量序列进行卷积处理的过程可以包括,以词向量序列
Figure PCTCN2020134518-appb-000003
中各个词向量
Figure PCTCN2020134518-appb-000004
的位置作为当前位置,对以当前位置为中心,r为半径的范围内的多个词向量,进行卷积运算,得到当前位置对应的文本片段的片段向量
Figure PCTCN2020134518-appb-000005
各个位置的片段向量依次排布,形成片段向量序列。
Specifically, the width W of the convolution window can be expressed as W=2r+1, where r is the coverage radius. The process of using a window with a width W=2r+1 to perform convolution processing on the word vector sequence may include:
Figure PCTCN2020134518-appb-000003
Word vectors in
Figure PCTCN2020134518-appb-000004
The position of is used as the current position, and the convolution operation is performed on multiple word vectors within the range of the current position as the center and r as the radius to obtain the fragment vector of the text fragment corresponding to the current position
Figure PCTCN2020134518-appb-000005
The fragment vectors at each position are arranged in sequence to form a sequence of fragment vectors.
图3示出在一个实施例中对词向量序列进行卷积处理的示意图。在图3的例子中,采用宽度为5(半径为2)的卷积窗口进行卷积处理。如图3所示,当以词向量
Figure PCTCN2020134518-appb-000006
作为当前词时,卷积窗口覆盖以该当前词为中心,前后各2个词向量共同形成的连续5个词向量,即
Figure PCTCN2020134518-appb-000007
对这5个词向量进行卷积运算,得到对应于该位置i的片段向量
Figure PCTCN2020134518-appb-000008
其中,上述卷积运算可以是通过激活函数定义的词向量的组合运算。当滑动该卷积窗口,接下来以词向量
Figure PCTCN2020134518-appb-000009
作为当前词时,就对以
Figure PCTCN2020134518-appb-000010
为中心的5个词向量进行卷积运算,得到对应于该位置i+1的片段向量
Figure PCTCN2020134518-appb-000011
通过依次以N个词向量中各个词向量为中心进行卷积处理,得到N个位置分别对应的片段向量,形成该卷积窗口对应的片段向量序列
Figure PCTCN2020134518-appb-000012
Fig. 3 shows a schematic diagram of performing convolution processing on a sequence of word vectors in an embodiment. In the example of FIG. 3, a convolution window with a width of 5 (radius 2) is used for convolution processing. As shown in Figure 3, when the word vector
Figure PCTCN2020134518-appb-000006
When used as the current word, the convolution window covers the current word as the center, and the continuous 5 word vectors formed by the two word vectors before and after, namely
Figure PCTCN2020134518-appb-000007
Perform convolution operation on these 5 word vectors to get the fragment vector corresponding to the position i
Figure PCTCN2020134518-appb-000008
Wherein, the aforementioned convolution operation may be a combination operation of word vectors defined by an activation function. When sliding the convolution window, the word vector
Figure PCTCN2020134518-appb-000009
As the current word
Figure PCTCN2020134518-appb-000010
Perform convolution operation on the 5 word vectors at the center to obtain the fragment vector corresponding to the position i+1
Figure PCTCN2020134518-appb-000011
By sequentially performing convolution processing with each word vector in the N word vectors as the center, the fragment vectors corresponding to the N positions are obtained, and the fragment vector sequence corresponding to the convolution window is formed
Figure PCTCN2020134518-appb-000012
以上描述了用特定宽度的卷积窗口对词向量序列进行卷积处理的过程。如前所述,在步骤23中,卷积层采用若干不同宽度的卷积窗口进行处理。例如,在一个具体例子中,采用宽度为3,5,9,15的四种卷积窗口,分别处理词向量序列X W,可以分别得到四个片段向量序列X S1,X S2,X S3,X S4,这四个片段向量序列分别表示,在长度为3,5,9,15个字词的文本片段的层级上对输入文本的表征。 The above describes the process of performing convolution processing on the word vector sequence with a convolution window of a specific width. As mentioned above, in step 23, the convolution layer uses several convolution windows with different widths for processing. For example, in a specific example, using four convolution windows with widths of 3, 5, 9, 15 and processing the word vector sequence X W separately, four fragment vector sequences X S1 , X S2 , X S3 , X S4 , these four fragment vector sequences respectively represent the representation of the input text at the level of text fragments with lengths of 3, 5, 9, 15 words.
在不同实施例中,可以根据输入文本的长度,要考虑的文本片段的长度等因素, 决定所采用的卷积窗口的数量,以及各个卷积窗口的宽度,如此得到若干片段向量序列。In different embodiments, the number of convolution windows used and the width of each convolution window can be determined according to factors such as the length of the input text, the length of the text fragments to be considered, and so on, so that several fragment vector sequences are obtained.
以上的词向量序列X W和若干片段向量序列X S,可以构成一个向量序列集合,该集合中的向量序列均包含N个h维的向量元素,可简单地统一记为向量序列X。 The above word vector sequence X W and several fragment vector sequences X S can form a vector sequence set. The vector sequences in the set contain N h-dimensional vector elements, which can be simply uniformly denoted as the vector sequence X.
于是,接下来在步骤24,分别将上述向量序列集合中的各个向量序列X输入到注意力层中的第一注意力模块,进行第一注意力处理,得到各个向量序列X对应的各个第一序列向量。如前所述,第一注意力模块又称为,(与标签的)互注意力模块,相应的,第一注意力处理又可称为标签注意力处理,其中根据输入向量序列与标签向量的相似度,得到对应的序列向量。具体的,第一注意力处理可以包括,对于输入向量序列X中的每个向量元素x i,根据该向量元素x i与步骤21获取的K个标签向量之间的相似度,确定该向量元素x i对应的第一权重因子,并利用第一权重因子对输入向量序列中各个向量元素加权求和,得到输入向量序列X对应的第一序列向量V1(X)。 Therefore, in step 24, each vector sequence X in the above vector sequence set is input to the first attention module in the attention layer, and the first attention processing is performed to obtain each first attention corresponding to each vector sequence X. Sequence vector. As mentioned earlier, the first attention module is also called the mutual attention module (with labels). Correspondingly, the first attention processing can also be called label attention processing. Similarity, the corresponding sequence vector is obtained. Specifically, the first attention processing may include, for each vector element x i in the input vector sequence X, determining the vector element according to the similarity between the vector element x i and the K label vectors obtained in step 21 The first weighting factor corresponding to x i , and the weighted summation of each vector element in the input vector sequence using the first weighting factor, to obtain the first sequence vector V1(X) corresponding to the input vector sequence X.
在一个具体实施例中,确定向量元素x i对应的第一权重因子可以通过以下方式执行。 In a specific embodiment, determining the first weighting factor corresponding to the vector element x i can be performed in the following manner.
首先,计算该向量元素x i与各个标签向量l j之间的相似度a ij,其中j为从1到K,于是得到K个相似度。 First, calculate the similarity a ij between the vector element x i and each label vector l j , where j is from 1 to K, so K similarities are obtained.
在一个例子中,向量元素x i与标签向量l j之间的相似度a ij可以通过余弦相似度计算,如以下公式(1)所示: In an example, the similarity a ij between the vector element x i and the label vector l j can be calculated by the cosine similarity, as shown in the following formula (1):
Figure PCTCN2020134518-appb-000013
Figure PCTCN2020134518-appb-000013
其中,
Figure PCTCN2020134518-appb-000014
表示x i的转置,‖x i‖表示x i的范数,或者向量长度,‖l j‖表示l j的范数。
in,
Figure PCTCN2020134518-appb-000014
It denotes the transpose of x i, ‖x i ‖ x i represents the norm, or vector length, ‖l j l j || represents the norm.
在另一例子中,向量元素x i与标签向量l j之间的相似度a ij还可以基于两者之间的欧式距离而确定,距离越大,相似度越小。在又一例子中,该相似度a ij还可以直接确定为,该向量元素x i与标签向量l j的点乘(内积)结果
Figure PCTCN2020134518-appb-000015
在更多例子中,相似度还可以通过其他方式确定。
In another example, the similarity a ij between the vector element x i and the label vector l j can also be determined based on the Euclidean distance between the two. The greater the distance, the smaller the similarity. In another example, the similarity a ij can also be directly determined as the dot product (inner product) result of the vector element x i and the label vector l j
Figure PCTCN2020134518-appb-000015
In more examples, the similarity can also be determined in other ways.
然后,对于确定出的向量元素x i与K个标签向量之间的K个相似度,可以确定出其中的最大值,并基于该最大值确定该向量元素x i对应的第一权重因子
Figure PCTCN2020134518-appb-000016
Then, for the determined K similarities between the vector elements x i and the K label vectors, the maximum value can be determined, and the first weighting factor corresponding to the vector element x i can be determined based on the maximum value.
Figure PCTCN2020134518-appb-000016
此处应该理解,作为分类的目标,K个类别的内容之间存在较大差异,相应的,对应的K个标签向量在相应的向量空间中通常彼此距离较远。只要向量元素x i与任意一个标签向量l j相似度较高,就说明该向量元素对应的词或文本片段,与对应的类别j之间有可能存在较大关联,因此,应给予该向量元素x i更多的关注或注意力(attention),为其赋予较高的权重。因此,在上述步骤中,根据相似度的最大值,确定向量元素的第一权重因子。 It should be understood here that, as the target of classification, there are large differences between the contents of the K categories, and correspondingly, the corresponding K label vectors are usually far away from each other in the corresponding vector space. As long as the vector element x i has a high similarity with any label vector l j , it means that the word or text fragment corresponding to the vector element may have a greater relationship with the corresponding category j. Therefore, the vector element should be given Xi is more concerned or attention (attention), giving it a higher weight. Therefore, in the above steps, the first weighting factor of the vector element is determined according to the maximum value of the similarity.
在一个实施例中,直接将K个相似度中的最大值作为该向量元素x i对应的第一权重因子
Figure PCTCN2020134518-appb-000017
In an embodiment, the maximum value of the K similarities is directly used as the first weighting factor corresponding to the vector element x i
Figure PCTCN2020134518-appb-000017
在另一实施例中,将向量元素x i对应的K个相似度中的最大值确定为该向量元素x i的互注意力分数a i,并且类似的,得到输入向量序列中各个向量元素各自对应的各个 互注意力分数。然后,根据各个向量元素对应的各个互注意力分数,对该向量元素x i的互注意力分数a i进行归一化处理,得到该向量元素对应的第一权重因子
Figure PCTCN2020134518-appb-000018
In another embodiment, the maximum value of the vector elements corresponding to the K x i of the similarity score is determined for a i x i mutual attention vector element, and similarly, the input vector to obtain a sequence of vector elements in each of the respective The corresponding mutual attention scores. Then, according to each mutual attention score corresponding to each vector element, normalize the mutual attention score a i of the vector element x i to obtain the first weighting factor corresponding to the vector element
Figure PCTCN2020134518-appb-000018
在一个具体例子中,上述归一化处理通过softmax函数实现,如以下公式(2)所示:In a specific example, the above-mentioned normalization processing is realized by the softmax function, as shown in the following formula (2):
Figure PCTCN2020134518-appb-000019
Figure PCTCN2020134518-appb-000019
在确定出输入向量序列X中各个向量元素对应的第一权重因子的基础上,第一注意力模块就可以基于第一权重因子,对各个向量元素加权求和,得到输入向量序列X的第一序列向量V1(X),即:On the basis of determining the first weight factor corresponding to each vector element in the input vector sequence X, the first attention module can weight and sum each vector element based on the first weight factor to obtain the first weight factor of the input vector sequence X. Sequence vector V1(X), namely:
Figure PCTCN2020134518-appb-000020
Figure PCTCN2020134518-appb-000020
图4示出在一个实施例中对输入向量序列进行第一注意力处理的示意图。如图4所示,以输入向量序列中N个向量元素作为行,以K个标签向量作为列,分别计算各个向量元素x i与各个标签向量l j之间的相似度,如此可以形成一个N*K维的相似度矩阵,称为标签注意力矩阵。对该标签注意力矩阵进行最大池化操作,也就是选取每个向量元素对应的一列中的最大值,得到各个向量元素的互注意力分数,然后基于互注意力分数得到其权重因子,基于权重因子对各个向量元素加权求和,得到输入向量序列的第一序列向量表示V1。 Fig. 4 shows a schematic diagram of performing first attention processing on an input vector sequence in an embodiment. As shown in Figure 4, taking N vector elements in the input vector sequence as rows and K label vectors as columns, respectively calculate the similarity between each vector element x i and each label vector l j , so that a N * The K-dimensional similarity matrix is called the label attention matrix. Perform the maximum pooling operation on the label attention matrix, that is, select the maximum value in a column corresponding to each vector element to obtain the mutual attention score of each vector element, and then obtain its weighting factor based on the mutual attention score, based on the weight The factor weights and sums each vector element to obtain the first sequence vector representation V1 of the input vector sequence.
通过对前述向量集合中各个向量序列X分别进行上述第一处理力处理,可以分别得到各自对应的第一序列向量。具体的,词向量序列X W得到对应的第一序列向量V1(X W),若干片段向量序列X S得到对应的若干第一序列向量V1(X S)。 By performing the above-mentioned first processing power processing on each vector sequence X in the aforementioned vector set, respectively, the corresponding first sequence vectors can be obtained respectively. Specifically, the word vector sequence X W obtains the corresponding first sequence vector V1 (X W ), and the several fragment vector sequences X S obtain the corresponding several first sequence vectors V1 (X S ).
于是,在步骤25中,根据以上各个向量序列对应的各个第一序列向量,得到输入文本的第一注意力表示S label。具体的,可以对各个第一序列向量,包括V1(X W)和若干V1(X S),进行综合,综合方式可以包括,求和,加权求和,求平均,等等,如此得到第一注意力表示S label Therefore, in step 25, the first attention representation S label of the input text is obtained according to each first sequence vector corresponding to each of the above vector sequences. Specifically, each first sequence vector, including V1(X W ) and several V1(X S ), can be synthesized. The synthesis method can include summation, weighted summation, averaging, etc., so that the first sequence vector is obtained. Attention means S label .
然后,在步骤26中,至少根据上述第一注意力表示S label,确定输入文本的表征向量S。在一个例子中,可以将第一注意力表示作为表征向量S。 Then, in step 26, the characterization vector S of the input text is determined according to at least the above-mentioned first attention representation S label. In an example, the first attention representation can be used as the characterization vector S.
接着,在步骤27,将表征向量S输入分类器14,通过分类器的运算,得到输入文本在K个类别中的类别预测结果。Next, in step 27, the characterization vector S is input to the classifier 14, and through the operation of the classifier, the category prediction results of the input text in the K categories are obtained.
通过以上过程可以看到,利用卷积层和第一注意力模块,表征向量中综合了不同长度的文本片段的语义信息以及与标签向量的相似度信息,由此使得基于表征向量进行文本分类时,更多地考虑到不同长度的上下文语义信息和与标签描述文本的相关度信息,从而得到更准确的类别预测结果。Through the above process, we can see that using the convolutional layer and the first attention module, the semantic information of text fragments of different lengths and the similarity information with the label vector are integrated in the representation vector, so that the text classification based on the representation vector , More consideration is given to contextual semantic information of different lengths and relevance information to the label description text, so as to obtain more accurate category prediction results.
根据一种实施方式,如图1中虚线框所示,卷积层13还可以包括第二注意力模块132和/或第三注意力模块133。下面描述第二注意力模块和第三注意力模块的处理过程。According to an embodiment, as shown by the dashed box in FIG. 1, the convolutional layer 13 may further include a second attention module 132 and/or a third attention module 133. The processing procedures of the second attention module and the third attention module are described below.
如前所述,第二注意力模块132又称为内注意力(intra-attention)模块,用于根据输入向量序列中每个向量元素与其他向量元素之间的相似度,对各个向量元素进行综合。As mentioned above, the second attention module 132 is also known as the intra-attention module, which is used to perform calculations on each vector element according to the similarity between each vector element in the input vector sequence and other vector elements. comprehensive.
具体而言,当将向量序列X输入该第二注意力模块132,该模块132对输入向量序列X执行第二注意力处理,又称为内注意力处理,该内注意力处理具体包括,对于输入向量序列X中的每个向量元素x i,根据该向量元素与输入向量序列X中各个其他向量元素x j之间的相似度,确定该向量元素x i对应的第二权重因子,并利用第二权重因子对输入序列中各个向量元素加权求和,得到输入向量序列X对应的第二序列向量V2(X)。 Specifically, when the vector sequence X is input to the second attention module 132, the module 132 performs second attention processing on the input vector sequence X, also called internal attention processing. The internal attention processing specifically includes: For each vector element x i in the input vector sequence X, according to the similarity between the vector element and each other vector element x j in the input vector sequence X, determine the second weighting factor corresponding to the vector element x i, and use The second weighting factor weights and sums each vector element in the input sequence to obtain the second sequence vector V2(X) corresponding to the input vector sequence X.
在一个具体实施例中,确定向量元素x i对应的第二权重因子可以通过以下方式执行。 In a specific embodiment, determining the second weighting factor corresponding to the vector element x i can be performed in the following manner.
首先,计算该向量元素x i与各个其他向量元素x j之间的各个相似度a ij。其中相似度的计算可以采用余弦相似度,或者基于向量距离,向量点乘结果等其他方式而确定,此处不再赘述。 First, calculate each similarity a ij between the vector element x i and each other vector element x j . The calculation of the similarity can adopt the cosine similarity, or it can be determined based on other methods such as the vector distance, the vector dot multiplication result, etc., which will not be repeated here.
然后,基于以上各个相似度的平均值,确定该向量元素x i对应的第二权重因子
Figure PCTCN2020134518-appb-000021
Then, based on the average of the above similarities, determine the second weighting factor corresponding to the vector element x i
Figure PCTCN2020134518-appb-000021
此处应理解,第二权重因子旨在衡量,某个向量元素与整个向量序列的总体语义的相关度。如果某个向量元素x i与序列中其他向量元素的相似度都比较高,则说明该向量元素对应的词或文本片段,与整个序列的核心语义存在较大关联,因此,应给予该向量元素x i更多的关注或注意力,为其赋予较高的权重。并且,在实际计算时,为了计算的方便,会针对每个向量元素x i计算出它与序列中N个向量元素的N个相似度,其中包括当j=i时,得到的该向量元素x i与自身的相似度,而该自身相似度是对应于相似度最大值的一个常量。因此,在确定第二权重因子时,优选依据各个相似度的平均值,而不是最大值来确定。 It should be understood here that the second weighting factor aims to measure the relevance of a certain vector element to the overall semantics of the entire vector sequence. If a vector element x i has a relatively high similarity with other vector elements in the sequence, it means that the word or text fragment corresponding to the vector element has a large correlation with the core semantics of the entire sequence. Therefore, the vector element should be given x i More attention or attention is given to it with higher weight. Moreover, in the actual calculation, for the convenience of calculation, the N similarities between each vector element x i and the N vector elements in the sequence will be calculated, including the vector element x obtained when j=i The similarity between i and itself, and the self-similarity is a constant corresponding to the maximum value of the similarity. Therefore, when determining the second weighting factor, it is preferable to determine it based on the average value of each similarity, rather than the maximum value.
在一个实施例中,直接将上述相似度的均值作为该向量元素x i对应的第二权重因子
Figure PCTCN2020134518-appb-000022
In an embodiment, the mean value of the aforementioned similarity is directly used as the second weighting factor corresponding to the vector element x i
Figure PCTCN2020134518-appb-000022
在另一实施例中,将向量元素x i对应的相似度均值确定为该向量元素x i的内注意力分数a i,然后基于各个向量元素的内注意力分数,例如通过softmax函数进行归一化处理,得到该向量元素x i对应的第二权重因子
Figure PCTCN2020134518-appb-000023
In another embodiment, the elements of the vector x i corresponding to the mean similarity score is determined for a i x i of vector elements within the attention, and attention based on the scores of each vector element, for example by a softmax normalization function Processing to obtain the second weighting factor corresponding to the vector element x i
Figure PCTCN2020134518-appb-000023
在确定出输入向量序列X中各个向量元素对应的第二权重因子的基础上,第二注意力模块就可以基于第二权重因子,对各个向量元素加权求和,得到输入向量序列X的第二序列向量V2(X),即:After determining the second weighting factor corresponding to each vector element in the input vector sequence X, the second attention module can weight and sum each vector element based on the second weighting factor to obtain the second weighting factor of the input vector sequence X. Sequence vector V2(X), namely:
Figure PCTCN2020134518-appb-000024
Figure PCTCN2020134518-appb-000024
图5示出在一个实施例中对输入向量序列进行第二注意力处理的示意图。如图5所示,将输入向量序列中的N个向量元素分别排布为行和列,分别计算两两向量元素x i与x j之间的相似度,如此可以形成一个N*N维的相似度矩阵,称为内注意力矩阵。对该内注意力矩阵进行平均池化操作,也就是计算每个向量元素对应的一列相似度值的平均值,得到各个向量元素的内注意力分数,然后基于内注意力分数得到其权重因子,基于权重因子对各个向量元素加权求和,得到输入向量序列的第二序列向量表示V2。 Fig. 5 shows a schematic diagram of performing second attention processing on an input vector sequence in an embodiment. As shown in Figure 5, the N vector elements in the input vector sequence are arranged into rows and columns, respectively, and the similarity between the two vector elements x i and x j is calculated, so that an N*N-dimensional The similarity matrix is called the internal attention matrix. Perform an average pooling operation on the internal attention matrix, that is, calculate the average value of a column of similarity values corresponding to each vector element to obtain the internal attention score of each vector element, and then obtain its weighting factor based on the internal attention score, Based on the weighting factor, each vector element is weighted and summed to obtain the second sequence vector representation V2 of the input vector sequence.
可以将前述向量集合中各个向量序列X分别输入到第二注意力模块132进行上述内处理力处理,从而分别得到各自对应的第二序列向量V2(X),包括词向量序列X W 对应的V2(X W),若干片段向量序列X S对应的若干第二序列向量V2(X S)。 Each vector sequence X in the aforementioned vector set can be input to the second attention module 132 to perform the above-mentioned internal processing power processing, so as to obtain respectively corresponding second sequence vectors V2(X), including V2 corresponding to the word vector sequence X W (X W ), a number of second sequence vectors V2(X S ) corresponding to a number of segment vector sequences X S.
接着,可以对以上各个向量序列对应的各个第二序列向量V2(X)进行综合,得到输入文本的第二注意力表示S intraThen, the second sequence vectors V2(X) corresponding to the above vector sequences can be synthesized to obtain the second attention representation S intra of the input text.
如此,在注意力层包括第一注意力模块131和第二注意力模块132的情况下,图2中确定表征向量S的步骤26可以包括,基于第一注意力表示S label和第二注意力表示S intra,确定表征向量S。具体可以通过多种方式,例如求和,加权求和,求均值等,对第一注意力表示S label和第二注意力表示S intra进行综合,得到表征向量S。 In this way, in the case where the attention layer includes the first attention module 131 and the second attention module 132, the step 26 of determining the characterization vector S in FIG. 2 may include, based on the first attention representation S label and the second attention Denote S intra, determine the characterization vector S. Specifically, the first attention representation S label and the second attention representation S intra can be synthesized through a variety of methods, such as summation, weighted summation, averaging, etc., to obtain the characterization vector S.
根据一种实施方式,注意力层13还可以包括第三注意力模块133。第三注意力模块133可称为自注意力(self attention)模块,用于进行自注意力处理,即根据输入向量序列中各个向量元素与注意力向量的相似度,对各个向量元素进行综合。According to an embodiment, the attention layer 13 may further include a third attention module 133. The third attention module 133 may be referred to as a self-attention module, which is used to perform self-attention processing, that is, to synthesize each vector element according to the similarity between each vector element in the input vector sequence and the attention vector.
具体地,自注意力模块133中维护一个注意力向量v,该向量与词嵌入得到的向量维度相同,均为h维。该注意力向量v包含的参数可以通过训练而确定。Specifically, the self-attention module 133 maintains an attention vector v, which has the same dimension as the vector obtained by word embedding, and both are h-dimensional. The parameters contained in the attention vector v can be determined through training.
此外,不同于第一/第二注意力模块那样对前述向量序列集合中各个向量序列分别进行处理,第三注意力模块133是对基于向量序列集合中各个向量序列所形成的一个总序列X’进行处理。在一个实施例中,上述总序列X’可以是将前述向量序列集合中各个向量序列依次拼接在一起,所形成的序列,即X’=X WX S1X S2…。 In addition, unlike the first/second attention modules, which process each vector sequence in the aforementioned vector sequence set separately, the third attention module 133 is based on a total sequence X'formed based on each vector sequence in the vector sequence set. To process. In an embodiment, the above-mentioned total sequence X′ may be a sequence formed by successively splicing each vector sequence in the aforementioned vector sequence set, that is, X′=X W X S1 X S2 .
于是,第三注意力模块133对该总序列X’进行第三注意力处理,即自注意力处理,具体包括,对于总序列X’中的每个向量元素x i,根据该向量元素x i与注意力向量v之间的相似度,确定该向量元素对应的第三权重因子,并利用所述第三权重因子对所述总序列中各个向量元素加权求和,得到输入文本的第三注意力表示。 Therefore, the third attention module 133 performs third attention processing on the total sequence X', that is, self-attention processing, which specifically includes, for each vector element x i in the total sequence X', according to the vector element x i The similarity with the attention vector v, the third weighting factor corresponding to the vector element is determined, and the third weighting factor is used to weight and sum each vector element in the total sequence to obtain the third attention of the input text Force expression.
在一个具体实施例中,确定向量元素x i对应的第三权重因子可以通过以下方式执行。 In a specific embodiment, determining the third weighting factor corresponding to the vector element x i can be performed in the following manner.
首先,计算该向量元素x i与注意力向量v之间的相似度a i,作为其自注意力分数。其中,相似度的计算可以采用余弦相似度,或者基于向量距离,向量点乘结果等其他方式而确定,此处不再赘述。 First, calculate the similarity a i between the vector element x i and the attention vector v as its self-attention score. Among them, the calculation of similarity can adopt cosine similarity, or it can be determined based on other methods such as vector distance, vector dot multiplication result, etc., which will not be repeated here.
然后,基于以上自注意力分数,确定该向量元素x i对应的第三权重因子
Figure PCTCN2020134518-appb-000025
在一个实施例中,直接将上述自注意力分数作为该向量元素x i对应的第三权重因子
Figure PCTCN2020134518-appb-000026
在另一实施例中,基于各个向量元素的自注意力分数,通过归一化处理,得到该向量元素x i对应的第三权重因子
Figure PCTCN2020134518-appb-000027
Then, based on the above self-attention score, determine the third weighting factor corresponding to the vector element x i
Figure PCTCN2020134518-appb-000025
In one embodiment, the above self-attention score is directly used as the third weighting factor corresponding to the vector element x i
Figure PCTCN2020134518-appb-000026
In another embodiment, based on the self-attention score of each vector element, through normalization processing, the third weighting factor corresponding to the vector element x i is obtained
Figure PCTCN2020134518-appb-000027
在一个具体例子中,向量元素x i与注意力向量v的相似度采用向量点乘计算,归一化采用softmax函数,如此可以得到以下的第三权重因子: In a specific example, the similarity between the vector element x i and the attention vector v is calculated by the vector dot product, and the normalization is by the softmax function, so that the following third weighting factor can be obtained:
Figure PCTCN2020134518-appb-000028
Figure PCTCN2020134518-appb-000028
其中,v T为注意力向量v的转置,M为总序列X’中包含的向量元素的数目。 Among them, v T is the transposition of the attention vector v, and M is the number of vector elements contained in the total sequence X'.
在确定出总序列X’中各个向量元素对应的第三权重因子的基础上,第三注意力模块就可以基于第三权重因子,对各个向量元素加权求和。由于总序列已经包含了各个向 量序列的信息,因此对总序列进行处理的结果直接可以作为输入文本的第三注意力表示S self,即: After determining the third weighting factor corresponding to each vector element in the total sequence X′, the third attention module can weight and sum each vector element based on the third weighting factor. Since the total sequence already contains the information of each vector sequence, the result of processing the total sequence can be directly used as the third attention representation S self of the input text, namely:
Figure PCTCN2020134518-appb-000029
Figure PCTCN2020134518-appb-000029
以上第三注意力模块133对各个向量序列拼接在一起所形成的总序列X’进行自注意力处理,得到第三注意力表示。The above third attention module 133 performs self-attention processing on the total sequence X'formed by splicing each vector sequence together to obtain a third attention representation.
进一步的,在一个实施例中,还可以对各个向量序列进行融合转换,得到对应的融合序列,并将融合序列与各个向量序列相拼接,形成更全面的总序列X’。Further, in one embodiment, each vector sequence can be fused and transformed to obtain a corresponding fusion sequence, and the fusion sequence and each vector sequence can be spliced together to form a more comprehensive total sequence X'.
在该实施例中,注意力层13还包括融合模块,用于对输入向量序列X进行融合转换处理,转换为对应的融合序列Q。该融合转换处理具体可以包括,对于输入向量序列X中的每个向量元素x i,根据该向量元素x i与前述K个标签向量中各个标签向量l j之间的相似度,确定与各个标签向量l j对应的标签权重因子,并基于所述标签权重因子将该向量元素x i转换为K个标签向量加权求和的融合向量q i,从而将输入向量序列X转换为对应的融合序列Q。 In this embodiment, the attention layer 13 further includes a fusion module, which is used to perform fusion conversion processing on the input vector sequence X and convert it into a corresponding fusion sequence Q. The fusion conversion processing may specifically include, for each vector element x i in the input vector sequence X, determining the difference with each label according to the similarity between the vector element x i and each label vector l j in the aforementioned K label vectors The label weight factor corresponding to the vector l j , and based on the label weight factor, the vector element x i is converted into the fusion vector q i of the weighted summation of K label vectors, thereby converting the input vector sequence X into the corresponding fusion sequence Q .
在一个具体实施例中,将向量元素x i对应转换为融合向量q i的过程可以通过以下方式执行。 In a specific embodiment, the process of correspondingly transforming the vector element x i into the fusion vector q i can be performed in the following manner.
首先,计算该向量元素x i与各个标签向量l j之间的相似度a ij,其中j为从1到K。相似度的计算方式可以通过例如公式(1)的方式实现,也可以基于向量距离,点乘运算等方式确定,不再赘述。 First, calculate the similarity a ij between the vector element x i and each label vector l j , where j is from 1 to K. The similarity calculation method can be realized by, for example, formula (1), or it can be determined based on vector distance, dot multiplication operation, etc., and will not be repeated.
然后,根据该向量元素x i与各个标签向量l j之间的相似度a ij,确定与各个标签向量l j对应的标签权重因子β jThen, according to the similarity a ij between the vector element x i and each label vector l j , the label weight factor β j corresponding to each label vector l j is determined.
在一个例子中,直接将相似度a ij作为标签向量l j对应的标签权重因子β j。在另一实施例中,还根据向量元素x i与各个标签向量的各个相似度,对相似度a ij进行归一化,作为标签向量l j对应的标签权重因子β j。例如,可以通过以下公式确定标签权重因子: In one example, directly as the label similarity weight a ij tag vectors l j corresponding weighting factor β j. In another embodiment, also according to the respective elements of the vector x i and the similarity of each tag vector, the similarity of a ij is normalized, as the tag label weight vectors corresponding to the weight factor l j β j. For example, the label weight factor can be determined by the following formula:
Figure PCTCN2020134518-appb-000030
Figure PCTCN2020134518-appb-000030
在确定出针对向量元素x i,各个标签向量l j的标签权重因子β j的基础上,就可以基于标签权重因子对各个标签向量加权求和,从而将向量元素x i转换为融合向量q iAfter determining the label weight factor β j of each label vector l j for the vector element x i , the weighted sum of each label vector can be based on the label weight factor, thereby converting the vector element x i into the fusion vector q i :
Figure PCTCN2020134518-appb-000031
Figure PCTCN2020134518-appb-000031
图6示出在一个实施例中对输入向量序列进行融合转换处理的示意图。如图6所示,以输入向量序列X中N个向量元素作为列,以K个标签向量作为行,分别计算各个向量元素x i与各个标签向量l j之间的相似度,如此可以形成一个相似度矩阵。对于每个向量元素x i,基于相似度矩阵中该向量元素对应的行中的各个相似度,确定各个标签向量对应的标签权重因子,基于标签权重因子对各个标签向量加权求和,得到该向量元素x i对应的融合向量q iFig. 6 shows a schematic diagram of performing fusion conversion processing on an input vector sequence in an embodiment. As shown in Figure 6, taking N vector elements in the input vector sequence X as columns and K label vectors as rows, respectively calculate the similarity between each vector element x i and each label vector l j , which can form a Similarity matrix. For each vector element x i , based on each similarity in the row corresponding to the vector element in the similarity matrix, determine the label weighting factor corresponding to each label vector, and weighted and sum each label vector based on the label weighting factor to obtain the vector element corresponding fusion vector x i q i.
可以理解,通过对输入向量序列X中各个向量元素x i分别转换为对应的融合向量q i,可以将向量序列X转换为融合序列Q。进一步地,通过分别将前述向量序列集合中 各个向量序列输入所述融合模块,可以得到各自对应的融合序列,例如词向量序列X W对应的融合序列Q W,片段向量序列X S对应的融合序列Q SIt can be understood that by converting each vector element x i in the input vector sequence X into a corresponding fusion vector q i , the vector sequence X can be converted into a fusion sequence Q. Further, by separately inputting each vector sequence in the aforementioned vector sequence set into the fusion module, each corresponding fusion sequence can be obtained, for example , the fusion sequence Q W corresponding to the word vector sequence X W and the fusion sequence corresponding to the fragment vector sequence X S Q S.
在一个实施例中,可以将原始的各个向量序列(X WX S1X S2…)和如上得到的各个融合序列(Q WQ S1Q S2…)进行拼接,得到所述总序列X’。然后利用第三注意力模块133处理该总序列X’,得到第三注意力表示S selfIn one embodiment, each original vector sequence (X W X S1 X S2 ...) and each fusion sequence (Q W Q S1 Q S2 ...) obtained as above can be spliced to obtain the total sequence X'. Then the third attention module 133 is used to process the total sequence X′ to obtain the third attention representation S self .
可以理解,在注意力层包括第一注意力模块131和第三注意力模块133的情况下,图2中确定表征向量S的步骤26可以包括,基于第一注意力表示S label和第三注意力表示S self,确定表征向量S。具体可以通过多种方式,对第一注意力表示S label和第三注意力表示S self进行综合,得到表征向量S。 It can be understood that in the case where the attention layer includes the first attention module 131 and the third attention module 133, the step 26 of determining the characterization vector S in FIG. 2 may include, based on the first attention representation S label and the third attention The force represents S self , and the characterization vector S is determined. Specifically, the first attention representation S label and the third attention representation S self can be synthesized in a variety of ways to obtain the characterization vector S.
在注意力层包括第一注意力模块131,第二注意力模块132和第三注意力133三者的情况下,图2中确定表征向量S的步骤26可以包括,基于第一注意力表示S label,第二注意力表示S intra和第三注意力表示S self,确定表征向量S。具体地,可以基于预先确定的权重系数,对第一注意力表示,第二注意力表示和第三注意力表示加权求和,得到表征向量S,如以下公式所示: In the case where the attention layer includes the first attention module 131, the second attention module 132, and the third attention 133, the step 26 of determining the characterization vector S in FIG. 2 may include, based on the first attention representation S label , the second attention represents S intra and the third attention represents S self , and the characterization vector S is determined. Specifically, the weighted summation of the first attention representation, the second attention representation, and the third attention representation can be based on a predetermined weight coefficient to obtain the characterization vector S, as shown in the following formula:
S=ω 1S label2S intra3S self       (9) S=ω 1 S label2 S intra3 S self (9)
其中,ω 1,ω 2,ω 3为权重系数,可以是预先设定的超参数。 Among them, ω 1 , ω 2 , and ω 3 are weight coefficients, which may be pre-set hyperparameters.
图7示出在一个实施例中注意力层的注意力处理示意图。该示意图示出注意力层包含第一,第二和第三注意力模块的情况下,各个注意力模块的输入和输出。FIG. 7 shows a schematic diagram of attention processing of the attention layer in an embodiment. The schematic diagram shows the input and output of each attention module when the attention layer contains the first, second and third attention modules.
如图所示,第一注意力模块的输入包括,由词向量序列X W和片段向量序列X S构成的向量序列集合,以及K个标签向量。对于向量序列集合中的每个向量序列X,第一注意力模块根据其中向量元素与K个标签向量的相似度,得到该向量序列的第一序列向量。通过对各个第一序列向量综合,可以得到输入文本的第一注意力表示S labelAs shown in the figure, the input of the first attention module includes a vector sequence set consisting of a word vector sequence X W and a segment vector sequence X S , and K label vectors. For each vector sequence X in the vector sequence set, the first attention module obtains the first sequence vector of the vector sequence according to the similarity between the vector elements and the K label vectors. By synthesizing each first sequence vector, the first attention representation S label of the input text can be obtained.
第二注意力模块的输入包括,前述向量序列集合。对于该集合中的每个向量序列X,第二注意力模块根据各个向量元素之间的相似度,得到该向量序列的第二序列向量。通过对各个第二序列向量综合,可以得到输入文本的第二注意力表示S intraThe input of the second attention module includes the aforementioned vector sequence set. For each vector sequence X in the set, the second attention module obtains the second sequence vector of the vector sequence according to the similarity between the various vector elements. By synthesizing each second sequence vector, the second attention representation S intra of the input text can be obtained.
融合模块的输入包括,前述向量序列集合以及K个标签向量。融合模块将向量序列集合中每个向量序列X通过融合转换处理,转换为融合向量Q。于是,输出与向量序列集合中各个向量序列对应的各个融合序列。The input of the fusion module includes the aforementioned vector sequence set and K label vectors. The fusion module converts each vector sequence X in the vector sequence set into a fusion vector Q through fusion conversion processing. Then, each fusion sequence corresponding to each vector sequence in the vector sequence set is output.
第三注意力模块的输入为,前述向量序列集合中各个向量序列,以及各个融合序列综合形成的总序列。第三注意力模块针对该总序列进行自注意力处理,得到输入文本的第三注意力表示S selfThe input of the third attention module is each vector sequence in the aforementioned vector sequence set, and the total sequence formed by the synthesis of each fusion sequence. The third attention module performs self-attention processing on the total sequence, and obtains the third attention representation S self of the input text.
输入文本的最终表征向量可以基于第一,第二和第三注意力模块的输出而综合得到。The final characterization vector of the input text can be synthesized based on the output of the first, second and third attention modules.
以上在图1和图2的基础上,分别描述了,在注意力层包括第一注意力模块的情况下,以及在注意力层还包括第二注意力模块和/或第三注意力模块的情况下,对输入文本的分类预测过程。需要理解的是,该分类预测过程既适用于文本分类模型的训练阶段, 也适用于模型训练完成后的使用阶段。On the basis of Fig. 1 and Fig. 2, it is described separately that when the attention layer includes the first attention module, and the attention layer also includes the second attention module and/or the third attention module. In this case, the process of classifying and predicting the input text. It needs to be understood that the classification prediction process is not only applicable to the training phase of the text classification model, but also applicable to the use phase after the model training is completed.
在文本分类模型的训练阶段,输入到模型的输入文本为训练文本,该训练文本对应具有指示其真实类别的类别标签y。针对训练阶段,在基于图2的方法步骤得到该训练文本的类别预测结果y’之后,还需要基于上述类别预测结果对模型进行训练,该训练过程如图8所示。In the training phase of the text classification model, the input text input to the model is training text, and the training text corresponds to a category label y indicating its true category. For the training stage, after the category prediction result y'of the training text is obtained based on the method steps of FIG. 2, the model needs to be trained based on the foregoing category prediction result. The training process is shown in FIG. 8.
具体的,图8示出模型训练阶段进一步包含的方法步骤。如图8所示,在步骤81,根据针对训练文本的类别预测结果y’和该训练文本的类别标签y,得到文本预测损失L textSpecifically, FIG. 8 shows the method steps further included in the model training stage. As shown in FIG. 8, in step 81, the text prediction loss L text is obtained according to the category prediction result y′ for the training text and the category label y of the training text.
可以理解,类别预测结果y’是由分类器14采用预定的分类函数,对输入文本的表征向量S进行运算而得到。因此,类别预测结果可以表示为:It can be understood that the category prediction result y'is obtained by the classifier 14 using a predetermined classification function to perform operations on the characterization vector S of the input text. Therefore, the category prediction result can be expressed as:
y′=f c(S)      (10) y′=f c (S) (10)
其中,f c为分类函数。一般地,类别预测结果y’包括,所预测的当前训练文本分别属于预定的K个类别的概率。于是,可以通过交叉熵形式的损失函数,基于类别预测结果y’指示的概率分布和类别标签y指示的真实分类,得到文本预测损失L text。在其他实施例中,也可以采取已知的其他损失函数形式,得到该文本预测损失L textAmong them, f c is the classification function. Generally, the category prediction result y′ includes the probability that the predicted current training text belongs to the predetermined K categories. Therefore, the text prediction loss L text can be obtained based on the probability distribution indicated by the category prediction result y′ and the actual classification indicated by the category label y through a loss function in the form of cross entropy. In other embodiments, other known loss function forms can also be used to obtain the text prediction loss L text .
在步骤82,至少根据上述文本预测损失L text,确定总预测损失L。在一个例子中,将文本预测损失确定为总预测损失L。 In step 82, the total prediction loss L is determined based on at least the aforementioned text prediction loss L text. In one example, the text prediction loss is determined as the total prediction loss L.
接着在步骤83,在使得总预测损失L减小的方向,更新文本分类模型。具体的,可以利用梯度下降,反向传播等方式,调整文本分类模型中的模型参数,使得总预测损失L减小,直到达到预定的收敛条件,从而实现模型的训练。Next, in step 83, the text classification model is updated in the direction that reduces the total prediction loss L. Specifically, gradient descent, back propagation and other methods can be used to adjust the model parameters in the text classification model, so that the total prediction loss L is reduced until a predetermined convergence condition is reached, thereby realizing the training of the model.
进一步的,在一个实施例,在计算总预测损失时,再一次利用前述的K个标签向量。具体的,可以将K个类别对应的K个标签向量l j(j从1到K)分别输入到分类器14,使得分类器14基于输入的标签向量进行分类预测,得到对应的K个标签预测结果,其中与标签向量l j对应的标签预测结果y″ j可以表示为: Further, in one embodiment, when calculating the total prediction loss, the aforementioned K label vectors are used again. Specifically, K label vectors l j (j from 1 to K) corresponding to the K categories can be input to the classifier 14 respectively, so that the classifier 14 performs classification prediction based on the input label vector to obtain the corresponding K label predictions as a result, vectors l j where tag labels corresponding prediction result y "j can be expressed as:
y″ j=f c(l j)      (11) y″ j = f c (l j ) (11)
然后,分别比较K个类别与其对应的标签预测结果,基于比较结果得到标签预测损失L label。具体地,可以针对每个类别,采用交叉熵损失函数,得到该类别下的标签预测损失,然后对各个类别的标签预测损失求和,得到总的标签预测损失L labelThen, the K categories and their corresponding label prediction results are respectively compared, and the label prediction loss L label is obtained based on the comparison results. Specifically, for each category, a cross-entropy loss function can be used to obtain the label prediction loss under the category, and then the label prediction loss of each category is summed to obtain the total label prediction loss L label .
在利用标签向量得到标签预测损失的情况下,图8中确定总损失的步骤82可以包括,根据文本预测损失L text和标签预测损失L label,确定总损失L。具体地,在一个实施例中,可以将总损失L确定为: In the case of using the label vector to obtain the label prediction loss, the step 82 of determining the total loss in FIG. 8 may include determining the total loss L according to the text prediction loss L text and the label prediction loss L label . Specifically, in an embodiment, the total loss L may be determined as:
L=L text+γL label      (12) L=L text +γL label (12)
其中,γ为超参数。Among them, γ is a hyperparameter.
通过在总损失中引入基于标签向量确定的标签预测损失,可以更有针对性地分类器进行更好的训练。By introducing the label prediction loss based on the label vector into the total loss, the classifier can be more targeted for better training.
在利用大量训练文本实现对文本分类模型的训练之后,就可以利用该文本分类模 型,对类别未知的输入文本进行分类预测。如前所述,由于分类预测模型融合了不同长度文本片段级别的语义信息以及标签描述文本的语义信息,从而可以以更高的准确度实现文本的分类预测。After a large amount of training text is used to train the text classification model, the text classification model can be used to classify and predict the input text of unknown category. As mentioned above, because the classification prediction model combines the semantic information of different length text segment levels and the semantic information of the label description text, the classification prediction of the text can be realized with higher accuracy.
根据另一方面的实施例,提供了一种利用文本分类模型进行分类预测的装置,该装置用于在预定的K个类别中预测输入文本对应的类别,所利用的文本分类模型包括嵌入层,卷积层,注意力层和分类器,注意力层进一步包括第一注意力模块,如图1所示。上述分类预测的装置可以部署在任何具有计算、处理能力的设备、平台或设备集群中。图9示出根据一个实施例的文本分类预测装置的示意性框图。如图9所示,该预测装置900包括以下单元。According to another embodiment, a device for classification prediction using a text classification model is provided. The device is used to predict the category corresponding to the input text in the predetermined K categories. The text classification model used includes an embedding layer, Convolutional layer, attention layer and classifier, attention layer further includes the first attention module, as shown in Figure 1. The above classification prediction device can be deployed in any device, platform or device cluster with computing and processing capabilities. Fig. 9 shows a schematic block diagram of a text classification prediction device according to an embodiment. As shown in FIG. 9, the prediction device 900 includes the following units.
标签向量获取单元901,配置为获取所述K个类别分别对应的K个标签向量,其中,每个标签向量通过对相应类别的标签描述文本进行词嵌入而得到;The label vector obtaining unit 901 is configured to obtain K label vectors respectively corresponding to the K categories, where each label vector is obtained by word embedding the label description text of the corresponding category;
词序列获取单元902,配置为利用所述嵌入层,对输入文本进行词嵌入,得到词向量序列;The word sequence obtaining unit 902 is configured to use the embedding layer to perform word embedding on the input text to obtain a word vector sequence;
片段序列获取单元903,配置为将所述词向量序列输入所述卷积层,所述卷积层利用与若干不同长度的文本片段相对应的若干卷积窗口,对所述词向量序列进行卷积处理,得到若干片段向量序列;所述词向量序列和若干片段向量序列构成向量序列集合;The segment sequence acquiring unit 903 is configured to input the word vector sequence into the convolutional layer, and the convolution layer uses a number of convolution windows corresponding to a number of text segments of different lengths to convolve the word vector sequence Product processing to obtain several fragment vector sequences; the word vector sequence and several fragment vector sequences constitute a vector sequence set;
第一注意力单元904,配置为分别将所述向量序列集合中的各个向量序列输入所述第一注意力模块,以进行第一注意力处理,得到各个向量序列对应的各个第一序列向量;其中,所述第一注意力处理包括,根据输入向量序列中各个向量元素与所述K个标签向量之间的相似度,确定各个向量元素分别对应的第一权重因子,并利用所述第一权重因子对各个向量元素加权求和;The first attention unit 904 is configured to input each vector sequence in the vector sequence set into the first attention module to perform first attention processing to obtain each first sequence vector corresponding to each vector sequence; Wherein, the first attention processing includes determining the first weighting factor corresponding to each vector element according to the similarity between each vector element in the input vector sequence and the K label vectors, and using the first weighting factor. The weighting factor is a weighted summation of each vector element;
第一表示获取单元905,配置为根据所述各个第一序列向量,得到所述输入文本的第一注意力表示;The first representation obtaining unit 905 is configured to obtain the first attention representation of the input text according to the respective first sequence vectors;
表征向量确定单元906,配置为至少根据所述第一注意力表示,确定所述输入文本的表征向量;A characterization vector determining unit 906, configured to determine a characterization vector of the input text at least according to the first attention expression;
预测结果获取单元907,配置为将所述表征向量输入所述分类器,得到所述输入文本在所述K个类别中的类别预测结果。The prediction result obtaining unit 907 is configured to input the characterization vector into the classifier to obtain category prediction results of the input text in the K categories.
在一个实施例中,输入文本为用户问题;相应的,所述K个类别中各个类别对应的标签描述文本包括,标准问题描述文本。In one embodiment, the input text is a user question; correspondingly, the label description text corresponding to each of the K categories includes a standard question description text.
在一个例子中,标签向量获取单元901配置为,通过以下方式预先确定所述K个标签向量:对所述K个类别中的每个类别,获取该类别对应的标签描述文本;对所述标签描述文本进行词嵌入,得到该标签描述文本中包含的各个描述词的词向量;对所述各个描述词的词向量进行综合,得到该类别对应的标签向量。In an example, the label vector obtaining unit 901 is configured to predetermine the K label vectors in the following manner: for each of the K categories, obtain the label description text corresponding to the category; The description text is word-embedded to obtain the word vector of each description word contained in the label description text; the word vectors of each description word are synthesized to obtain the label vector corresponding to the category.
根据一个实施例,所述第一注意力单元904涉及的第一注意力处理中,通过以下方式确定各个向量元素对应的第一权重因子:对于输入向量序列中每个向量元素,计算该向量元素与所述K个标签向量之间的K个相似度;基于所述K个相似度中的最大值,确定该向量元素对应的第一权重因子。According to an embodiment, in the first attention processing involved in the first attention unit 904, the first weighting factor corresponding to each vector element is determined by the following method: For each vector element in the input vector sequence, calculate the vector element K similarities with the K label vectors; based on the maximum value of the K similarities, the first weighting factor corresponding to the vector element is determined.
进一步的,可以通过以下方式计算该向量元素与所述K个标签向量之间的K个相似度:计算该向量元素与各个标签向量之间的余弦相似度;或者,基于该向量元素与各个标签向量之间的欧式距离,确定其相似度;或者,基于该向量元素与各个标签向量的点乘结果,确定其相似度。Further, the K similarities between the vector element and the K label vectors can be calculated by: calculating the cosine similarity between the vector element and each label vector; or, based on the vector element and each label The Euclidean distance between the vectors determines the similarity; or, based on the dot product result of the vector element and each label vector, the similarity is determined.
在一个示例中,基于所述K个相似度中的最大值,确定该向量元素对应的第一权重因子,可以包括:基于所述K个相似度中的最大值,确定该向量元素的互注意力分数;根据所述各个向量元素对应的各个互注意力分数,对该向量元素的互注意力分数进行归一化处理,得到该向量元素对应的第一权重因子。In an example, determining the first weighting factor corresponding to the vector element based on the maximum value of the K similarities may include: determining the mutual attention of the vector element based on the maximum value of the K similarities Force score; according to each mutual attention score corresponding to each vector element, normalize the mutual attention score of the vector element to obtain the first weighting factor corresponding to the vector element.
在一个示例中,根据所述各个第一序列向量,得到所述输入文本的第一注意力表示,包括:根据一个实施例,通过对所述各个第一序列向量进行综合,得到所述第一注意力表示,所述综合包括以下之一:求和,加权求和,求平均。In an example, obtaining the first attention representation of the input text according to the respective first sequence vectors includes: according to an embodiment, by synthesizing the respective first sequence vectors to obtain the first Attention means that the synthesis includes one of the following: summation, weighted summation, and averaging.
根据一种实施方式,文本分类模型的注意力层还包括第二注意力模块。相应地,所述装置900还包括(图中未示出)第二注意力单元和第二表示获取单元,其中:第二注意力单元配置为,分别将所述向量序列集合中的各个向量序列输入所述第二注意力模块,以进行第二注意力处理,得到各个向量序列对应的各个第二序列向量;其中,所述第二注意力处理包括,对于输入向量序列中的每个向量元素,根据该向量元素与所述输入向量序列中各个其他向量元素之间的相似度,确定该向量元素对应的第二权重因子,并利用所述第二权重因子对输入序列中各个向量元素加权求和;第二表示获取单元配置为,根据所述各个第二序列向量,得到所述输入文本的第二注意力表示。According to an embodiment, the attention layer of the text classification model further includes a second attention module. Correspondingly, the device 900 further includes (not shown in the figure) a second attention unit and a second representation acquisition unit, wherein: the second attention unit is configured to separately set each vector sequence in the vector sequence set Input the second attention module to perform second attention processing to obtain each second sequence vector corresponding to each vector sequence; wherein, the second attention processing includes, for each vector element in the input vector sequence , According to the similarity between the vector element and each other vector element in the input vector sequence, determine the second weighting factor corresponding to the vector element, and use the second weighting factor to weight each vector element in the input sequence And; the second representation obtaining unit is configured to obtain a second attention representation of the input text according to the respective second sequence vectors.
在这样的情况下,图9中的表征向量确定单元906配置为,根据所述第一注意力表示和所述第二注意力表示,确定所述表征向量。In this case, the characterization vector determining unit 906 in FIG. 9 is configured to determine the characterization vector according to the first attention expression and the second attention expression.
更具体的,第二注意力单元涉及的第二注意力处理中,可以通过以下方式确定向量元素对应的第二权重因子:计算该向量元素与所述各个其他向量元素之间的各个相似度;基于所述各个相似度的平均值,确定该向量元素对应的第二权重因子。More specifically, in the second attention processing involved in the second attention unit, the second weighting factor corresponding to the vector element can be determined in the following manner: calculating the respective similarities between the vector element and the other vector elements; Based on the average of the respective similarities, the second weighting factor corresponding to the vector element is determined.
根据另一种实施方式,注意力层还包括第三注意力模块,其中维护注意力向量。相应地,所述装置900还包括(图中未示出)总序列形成单元和第三注意力单元,其中,According to another embodiment, the attention layer further includes a third attention module in which attention vectors are maintained. Correspondingly, the device 900 further includes (not shown in the figure) a total sequence forming unit and a third attention unit, wherein,
总序列形成单元配置为,至少基于所述向量序列集合中各个向量序列的拼接,形成总序列;第三注意力单元配置为,利用所述第三注意力模块,对所述总序列进行第三注意力处理,所述第三注意力处理包括,对于所述总序列中的每个向量元素,根据该向量元素与所述注意力向量之间的相似度,确定该向量元素对应的第三权重因子,并利用所述第三权重因子对所述总序列中各个向量元素加权求和,得到所述输入文本的第三注意力表示。The total sequence forming unit is configured to form a total sequence based at least on the splicing of each vector sequence in the vector sequence set; the third attention unit is configured to use the third attention module to perform a third operation on the total sequence Attention processing, the third attention processing includes, for each vector element in the total sequence, determining a third weight corresponding to the vector element according to the similarity between the vector element and the attention vector Factor, and use the third weighting factor to weight and sum each vector element in the total sequence to obtain the third attention representation of the input text.
在注意力层包括第一注意力模块和第三注意力模块的情况下,前述表征向量确定单元906配置为,根据所述第一注意力表示和所述第三注意力表示,确定所述表征向量。In the case where the attention layer includes the first attention module and the third attention module, the aforementioned characterization vector determining unit 906 is configured to determine the characterization according to the first attention expression and the third attention expression vector.
在注意力层包括第一注意力模块、第二注意力模块和第三注意力模块的情况下,前述表征向量确定单元906配置为,根据所述第一注意力表示,第二注意力表示和所述第三注意力表示,确定所述表征向量。In the case that the attention layer includes the first attention module, the second attention module, and the third attention module, the aforementioned characterization vector determining unit 906 is configured to, according to the first attention expression, the second attention expression and The third attention representation determines the characterization vector.
具体的,在一个例子中,表征向量确定单元906可以基于预先确定的权重系数,对所述第一注意力表示,所述第二注意力表示和第三注意力表示加权求和,得到所述表征向量。Specifically, in an example, the characterization vector determining unit 906 may perform a weighted summation of the first attention expression, the second attention expression and the third attention expression based on a predetermined weight coefficient to obtain the Representation vector.
在一个实施例中,注意力层还包括融合模块。相应地,装置900还包括融合单元(未示出),配置为,分别将所述向量序列集合中的各个向量序列输入所述融合模块进 行融合转换处理,得到各个向量序列对应的各个融合序列,其中所述融合转换处理包括,对于输入向量序列中的每个向量元素,根据该向量元素与所述K个标签向量中各个标签向量之间的相似度,确定与各个标签向量对应的标签权重因子,并基于所述标签权重因子将该向量元素转换为所述K个标签向量加权求和的融合向量,从而将输入向量序列转换为对应的融合序列。In one embodiment, the attention layer further includes a fusion module. Correspondingly, the device 900 further includes a fusion unit (not shown) configured to input each vector sequence in the vector sequence set into the fusion module to perform fusion conversion processing to obtain each fusion sequence corresponding to each vector sequence, The fusion conversion processing includes, for each vector element in the input vector sequence, determining the label weighting factor corresponding to each label vector according to the similarity between the vector element and each of the K label vectors , And based on the tag weight factor, convert the vector element into a fusion vector of the K tag vectors weighted and sum, thereby converting the input vector sequence into a corresponding fusion sequence.
在这样的情况下,所述总序列形成单元可以配置为,将所述各个向量序列和所述各个融合序列进行拼接,得到所述总序列。In this case, the total sequence forming unit may be configured to splice the respective vector sequences and the respective fusion sequences to obtain the total sequence.
在一个实施例中,所述输入文本为训练文本,所述训练文本对应具有指示其真实类别的类别标签;所述装置900还包括训练单元(未示出),配置为,根据所述类别预测结果和所述类别标签,得到文本预测损失;至少根据所述文本预测损失,确定总预测损失;在使得所述总预测损失减小的方向,更新所述文本分类模型。In one embodiment, the input text is training text, and the training text correspondingly has a category label indicating its true category; the device 900 further includes a training unit (not shown) configured to predict according to the category As a result and the category label, the text prediction loss is obtained; at least the total prediction loss is determined according to the text prediction loss; and the text classification model is updated in the direction that reduces the total prediction loss.
在又一个实施例中,训练单元还配置为:将所述K个类别对应的K个标签向量分别输入所述分类器,得到对应的K个预测结果;分别比较所述K个类别与其对应的预测结果,基于比较结果得到标签预测损失;根据所述文本预测损失和所述标签预测损失,确定所述总损失。In yet another embodiment, the training unit is further configured to: input the K label vectors corresponding to the K categories into the classifier to obtain the corresponding K prediction results; respectively compare the K categories with their corresponding According to the prediction result, the label prediction loss is obtained based on the comparison result; the total loss is determined according to the text prediction loss and the label prediction loss.
如此,通过以上装置,利用文本分类模型,实现对输入文本的准确分类。In this way, through the above device, the text classification model is used to achieve accurate classification of the input text.
根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行结合图2所描述的方法。According to another embodiment, there is also provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.
根据再一方面的实施例,还提供一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现结合图2所述的方法。According to an embodiment of still another aspect, there is also provided a computing device, including a memory and a processor, the memory is stored with executable code, and when the processor executes the executable code, it implements the method described in conjunction with FIG. method.
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。Those skilled in the art should be aware that, in one or more of the foregoing examples, the functions described in this application can be implemented by hardware, software, firmware, or any combination thereof. When implemented by software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium.
以上所述的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。The specific implementations described above further describe the purpose, technical solutions and beneficial effects of this application in detail. It should be understood that the above are only specific implementations of this application and are not intended to limit the scope of this application. The scope of protection, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solution of this application shall be included in the scope of protection of this application.

Claims (18)

  1. 一种利用文本分类模型进行分类预测的方法,用于在预定的K个类别中预测输入文本对应的类别;所述文本分类模型包括嵌入层,卷积层,注意力层和分类器,所述注意力层包括第一注意力模块,所述方法包括:A method for classification prediction using a text classification model, which is used to predict the category corresponding to the input text in predetermined K categories; the text classification model includes an embedding layer, a convolutional layer, an attention layer, and a classifier. The attention layer includes a first attention module, and the method includes:
    获取所述K个类别分别对应的K个标签向量,其中,每个标签向量通过对相应类别的标签描述文本进行词嵌入而得到;Obtaining K label vectors respectively corresponding to the K categories, where each label vector is obtained by word embedding the label description text of the corresponding category;
    利用所述嵌入层,对输入文本进行词嵌入,得到词向量序列;Using the embedding layer, perform word embedding on the input text to obtain a word vector sequence;
    将所述词向量序列输入所述卷积层,所述卷积层利用与若干不同长度的文本片段相对应的若干卷积窗口,对所述词向量序列进行卷积处理,得到若干片段向量序列;所述词向量序列和若干片段向量序列构成向量序列集合;The word vector sequence is input to the convolutional layer, and the convolutional layer uses a number of convolution windows corresponding to a number of text fragments of different lengths to perform convolution processing on the word vector sequence to obtain a number of fragment vector sequences ; The word vector sequence and several fragment vector sequences constitute a vector sequence set;
    分别将所述向量序列集合中的各个向量序列输入所述第一注意力模块,以进行第一注意力处理,得到各个向量序列对应的各个第一序列向量;其中,所述第一注意力处理包括,根据输入向量序列中各个向量元素与所述K个标签向量之间的相似度,确定各个向量元素分别对应的第一权重因子,并利用所述第一权重因子对各个向量元素加权求和;Each vector sequence in the vector sequence set is input into the first attention module to perform first attention processing to obtain each first sequence vector corresponding to each vector sequence; wherein, the first attention processing Including, according to the similarity between each vector element in the input vector sequence and the K label vectors, determine the first weight factor corresponding to each vector element, and use the first weight factor to weight and sum each vector element ;
    根据所述各个第一序列向量,得到所述输入文本的第一注意力表示;Obtaining the first attention representation of the input text according to the respective first sequence vectors;
    至少根据所述第一注意力表示,确定所述输入文本的表征向量;Determine a characterization vector of the input text at least according to the first attention representation;
    将所述表征向量输入所述分类器,得到所述输入文本在所述K个类别中的类别预测结果。The characterization vector is input to the classifier to obtain category prediction results of the input text in the K categories.
  2. 根据权利要求1所述的方法,其中,所述输入文本为用户问题;所述K个类别中各个类别对应的标签描述文本包括,标准问题描述文本。The method according to claim 1, wherein the input text is a user question; the label description text corresponding to each of the K categories includes a standard question description text.
  3. 根据权利要求1或2所述的方法,其中,所述K个标签向量通过以下方式预先确定:The method according to claim 1 or 2, wherein the K label vectors are predetermined in the following manner:
    对所述K个类别中的每个类别,获取该类别对应的标签描述文本;For each of the K categories, obtain the label description text corresponding to the category;
    对所述标签描述文本进行词嵌入,得到该标签描述文本中包含的各个描述词的词向量;Performing word embedding on the label description text to obtain the word vector of each description word contained in the label description text;
    对所述各个描述词的词向量进行综合,得到该类别对应的标签向量。The word vectors of the various descriptors are synthesized to obtain the label vector corresponding to the category.
  4. 根据权利要求1所述的方法,其中,根据输入向量序列中各个向量元素与所述K个标签向量之间的相似度,确定各个向量元素对应的第一权重因子,包括:The method according to claim 1, wherein determining the first weighting factor corresponding to each vector element according to the similarity between each vector element in the input vector sequence and the K label vectors includes:
    对于输入向量序列中每个向量元素,计算该向量元素与所述K个标签向量之间的K个相似度;For each vector element in the input vector sequence, calculate K similarities between the vector element and the K label vectors;
    基于所述K个相似度中的最大值,确定该向量元素对应的第一权重因子。Based on the maximum value of the K similarities, the first weighting factor corresponding to the vector element is determined.
  5. 根据权利要求4所述的方法,其中,计算该向量元素与所述K个标签向量之间的K个相似度,包括:The method according to claim 4, wherein calculating the K similarities between the vector element and the K label vectors comprises:
    计算该向量元素与各个标签向量之间的余弦相似度;或者,Calculate the cosine similarity between the vector element and each label vector; or,
    基于该向量元素与各个标签向量之间的欧式距离,确定其相似度;或者,Determine the similarity based on the Euclidean distance between the vector element and each label vector; or,
    基于该向量元素与各个标签向量的点乘结果,确定其相似度。Based on the dot product result of the vector element and each label vector, the similarity is determined.
  6. 根据权利要求4所述的方法,其中,基于所述K个相似度中的最大值,确定该向量元素对应的第一权重因子,包括:The method according to claim 4, wherein, based on the maximum value of the K similarities, determining the first weighting factor corresponding to the vector element comprises:
    基于所述K个相似度中的最大值,确定该向量元素的互注意力分数;Determine the mutual attention score of the vector element based on the maximum value of the K similarities;
    根据所述各个向量元素对应的各个互注意力分数,对该向量元素的互注意力分数进行归一化处理,得到该向量元素对应的第一权重因子。According to each mutual attention score corresponding to each vector element, normalizing the mutual attention score of the vector element is performed to obtain the first weighting factor corresponding to the vector element.
  7. 根据权利要求1所述的方法,其中,根据所述各个第一序列向量,得到所述输入文本的第一注意力表示,包括:The method according to claim 1, wherein obtaining the first attention representation of the input text according to the respective first sequence vectors comprises:
    对所述各个第一序列向量进行综合,得到所述第一注意力表示,所述综合包括以下之一:求和,加权求和,求平均。The first sequence vectors are synthesized to obtain the first attention expression, and the synthesis includes one of the following: summation, weighted summation, and averaging.
  8. 根据权利要求1所述的方法,其中,所述注意力层还包括第二注意力模块;所述方法还包括,The method according to claim 1, wherein the attention layer further comprises a second attention module; the method further comprises,
    分别将所述向量序列集合中的各个向量序列输入所述第二注意力模块,以进行第二注意力处理,得到各个向量序列对应的各个第二序列向量;其中,所述第二注意力处理包括,对于输入向量序列中的每个向量元素,根据该向量元素与所述输入向量序列中各个其他向量元素之间的相似度,确定该向量元素对应的第二权重因子,并利用所述第二权重因子对输入序列中各个向量元素加权求和;Each vector sequence in the vector sequence set is input into the second attention module to perform second attention processing to obtain each second sequence vector corresponding to each vector sequence; wherein, the second attention processing Including, for each vector element in the input vector sequence, according to the similarity between the vector element and each other vector element in the input vector sequence, determine the second weighting factor corresponding to the vector element, and use the first Two weighting factors weighted and sum each vector element in the input sequence;
    根据所述各个第二序列向量,得到所述输入文本的第二注意力表示;Obtaining a second attention representation of the input text according to each of the second sequence vectors;
    所述至少根据所述第一注意力表示,确定所述输入文本的表征向量,包括,根据所述第一注意力表示和所述第二注意力表示,确定所述表征向量。The determining a characterization vector of the input text at least according to the first attention expression includes determining the characterization vector according to the first attention expression and the second attention expression.
  9. 根据权利要求8所述的方法,其中,根据该向量元素与所述输入向量序列中各个其他向量元素之间的相似度,确定该向量元素对应的第二权重因子,包括:8. The method according to claim 8, wherein determining the second weighting factor corresponding to the vector element according to the similarity between the vector element and each other vector element in the input vector sequence comprises:
    计算该向量元素与所述各个其他向量元素之间的各个相似度;Calculating each similarity between the vector element and each of the other vector elements;
    基于所述各个相似度的平均值,确定该向量元素对应的第二权重因子。Based on the average of the respective similarities, the second weighting factor corresponding to the vector element is determined.
  10. 根据权利要求1所述的方法,其中,所述注意力层还包括第三注意力模块,其中维护注意力向量;所述方法还包括,The method according to claim 1, wherein the attention layer further comprises a third attention module, wherein an attention vector is maintained; the method further comprises,
    至少基于所述向量序列集合中各个向量序列的拼接,形成总序列;At least based on the splicing of each vector sequence in the vector sequence set to form a total sequence;
    利用所述第三注意力模块,对所述总序列进行第三注意力处理,所述第三注意力处理包括,对于所述总序列中的每个向量元素,根据该向量元素与所述注意力向量之间的相似度,确定该向量元素对应的第三权重因子,并利用所述第三权重因子对所述总序列中各个向量元素加权求和,得到所述输入文本的第三注意力表示;Use the third attention module to perform third attention processing on the total sequence, and the third attention processing includes, for each vector element in the total sequence, according to the vector element and the attention The similarity between force vectors, determine the third weighting factor corresponding to the vector element, and use the third weighting factor to weight and sum each vector element in the total sequence to obtain the third attention of the input text Express;
    所述至少根据所述第一注意力表示,确定所述输入文本的表征向量,包括,根据所述第一注意力表示和所述第三注意力表示,确定所述表征向量。The determining a characterization vector of the input text at least according to the first attention expression includes determining the characterization vector according to the first attention expression and the third attention expression.
  11. 根据权利要求8所述的方法,其中,所述注意力层还包括第三注意力模块,其中维护注意力向量;所述方法还包括,The method according to claim 8, wherein the attention layer further comprises a third attention module, wherein an attention vector is maintained; the method further comprises,
    至少基于所述向量序列集合中各个向量序列的拼接,形成总序列;At least based on the splicing of each vector sequence in the vector sequence set to form a total sequence;
    利用所述第三注意力模块,对所述总序列进行第三注意力处理,所述第三注意力处理包括,对于所述总序列中的每个向量元素,根据该向量元素与所述注意力向量之间的相似度,确定该向量元素对应的第三权重因子,并利用所述第三权重因子对所述总序列中各个向量元素加权求和,得到所述输入文本的第三注意力表示;Use the third attention module to perform third attention processing on the total sequence, and the third attention processing includes, for each vector element in the total sequence, according to the vector element and the attention The similarity between force vectors, determine the third weighting factor corresponding to the vector element, and use the third weighting factor to weight and sum each vector element in the total sequence to obtain the third attention of the input text Express;
    所述至少根据所述第一注意力表示,确定所述输入文本的表征向量,包括,根据所述第一注意力表示,所述第二注意力表示和所述第三注意力表示,确定所述表征向量。The determining a characterization vector of the input text at least according to the first attention expression includes, according to the first attention expression, the second attention expression, and the third attention expression, determining the The representation vector.
  12. 根据权利要求10或11所述的方法,其中,所述注意力层还包括融合模块;在所述形成总序列之前,所述方法还包括:The method according to claim 10 or 11, wherein the attention layer further comprises a fusion module; before the forming the total sequence, the method further comprises:
    分别将所述向量序列集合中的各个向量序列输入所述融合模块进行融合转换处理,得到各个向量序列对应的各个融合序列,其中所述融合转换处理包括,对于输入向量序列中的每个向量元素,根据该向量元素与所述K个标签向量中各个标签向量之间的相似度,确定与各个标签向量对应的标签权重因子,并基于所述标签权重因子将该向量元素转换为所述K个标签向量加权求和的融合向量,从而将输入向量序列转换为对应的融合序列;Each vector sequence in the vector sequence set is input into the fusion module to perform fusion conversion processing to obtain each fusion sequence corresponding to each vector sequence, wherein the fusion conversion processing includes, for each vector element in the input vector sequence , According to the similarity between the vector element and each of the K label vectors, determine the label weight factor corresponding to each label vector, and convert the vector element into the K label weight factors based on the label weight factor. The fusion vector of the weighted sum of the label vector, thereby converting the input vector sequence into the corresponding fusion sequence;
    所述形成总序列包括,将所述各个向量序列和所述各个融合序列进行拼接,得到所述总序列。The forming the total sequence includes splicing the respective vector sequences and the respective fusion sequences to obtain the total sequence.
  13. 根据权利要求11所述的方法,其中,确定所述表征向量包括:The method of claim 11, wherein determining the characterization vector comprises:
    基于预先确定的权重系数,对所述第一注意力表示,所述第二注意力表示和第三注意力表示加权求和,得到所述表征向量。Based on a predetermined weight coefficient, the first attention representation, the second attention representation, and the third attention representation are weighted and summed to obtain the characterization vector.
  14. 根据权利要求1所述的方法,其中,所述输入文本为训练文本,所述训练文本对应具有指示其真实类别的类别标签;所述方法还包括:The method according to claim 1, wherein the input text is training text, and the training text correspondingly has a category label indicating its true category; the method further comprises:
    根据所述类别预测结果和所述类别标签,得到文本预测损失;Obtain the text prediction loss according to the category prediction result and the category label;
    至少根据所述文本预测损失,确定总预测损失;Determine the total prediction loss at least according to the text prediction loss;
    在使得所述总预测损失减小的方向,更新所述文本分类模型。Update the text classification model in a direction that reduces the total prediction loss.
  15. 根据权利要求14所述的方法,还包括:The method according to claim 14, further comprising:
    将所述K个类别对应的K个标签向量分别输入所述分类器,得到对应的K个预测结果;Input the K label vectors corresponding to the K categories into the classifier respectively to obtain the corresponding K prediction results;
    分别比较所述K个类别与其对应的预测结果,基于比较结果得到标签预测损失;Respectively comparing the K categories with their corresponding prediction results, and obtaining label prediction losses based on the comparison results;
    所述确定总损失包括,根据所述文本预测损失和所述标签预测损失,确定总损失。The determining the total loss includes determining the total loss according to the text prediction loss and the label prediction loss.
  16. 一种利用文本分类模型进行分类预测的装置,用于在预定的K个类别中预测输入文本对应的类别;所述文本分类模型包括嵌入层,卷积层,注意力层和分类器,所述注意力层包括第一注意力模块,所述装置包括:A device for classification prediction using a text classification model, used for predicting the category corresponding to the input text in predetermined K categories; the text classification model includes an embedding layer, a convolutional layer, an attention layer and a classifier, the The attention layer includes a first attention module, and the device includes:
    标签向量获取单元,配置为获取所述K个类别分别对应的K个标签向量,其中,每个标签向量通过对相应类别的标签描述文本进行词嵌入而得到;The label vector obtaining unit is configured to obtain K label vectors corresponding to the K categories, wherein each label vector is obtained by word embedding the label description text of the corresponding category;
    词序列获取单元,配置为利用所述嵌入层,对输入文本进行词嵌入,得到词向量序列;The word sequence obtaining unit is configured to use the embedding layer to perform word embedding on the input text to obtain a word vector sequence;
    片段序列获取单元,配置为将所述词向量序列输入所述卷积层,所述卷积层利用与若干不同长度的文本片段相对应的若干卷积窗口,对所述词向量序列进行卷积处理,得到若干片段向量序列;所述词向量序列和若干片段向量序列构成向量序列集合;A segment sequence acquisition unit configured to input the word vector sequence into the convolution layer, and the convolution layer uses a number of convolution windows corresponding to a number of text segments of different lengths to convolve the word vector sequence Processing to obtain several fragment vector sequences; the word vector sequence and several fragment vector sequences constitute a vector sequence set;
    第一注意力单元,配置为分别将所述向量序列集合中的各个向量序列输入所述第一注意力模块,以进行第一注意力处理,得到各个向量序列对应的各个第一序列向量;其中,所述第一注意力处理包括,根据输入向量序列中各个向量元素与所述K个标签向量之间的相似度,确定各个向量元素分别对应的第一权重因子,并利用所述第一权重因子对各个向量元素加权求和;The first attention unit is configured to input each vector sequence in the vector sequence set into the first attention module to perform first attention processing to obtain each first sequence vector corresponding to each vector sequence; wherein , The first attention processing includes, according to the similarity between each vector element in the input vector sequence and the K label vectors, determining a first weighting factor corresponding to each vector element, and using the first weight The factor is a weighted summation of each vector element;
    第一表示获取单元,配置为根据所述各个第一序列向量,得到所述输入文本的第一注意力表示;A first representation obtaining unit, configured to obtain a first attention representation of the input text according to each of the first sequence vectors;
    表征向量确定单元,配置为至少根据所述第一注意力表示,确定所述输入文本的表征向量;A characterization vector determining unit, configured to determine a characterization vector of the input text at least according to the first attention expression;
    预测结果获取单元,配置为将所述表征向量输入所述分类器,得到所述输入文本在所述K个类别中的类别预测结果。The prediction result obtaining unit is configured to input the characterization vector into the classifier to obtain category prediction results of the input text in the K categories.
  17. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-15中任一项的所述的方法。A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method according to any one of claims 1-15.
  18. 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-15中任一项所述的方法。A computing device, comprising a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method described in any one of claims 1-15 is method.
PCT/CN2020/134518 2020-01-16 2020-12-08 Method and apparatus for carrying out classification prediction by using text classification model WO2021143396A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010049397.9A CN111291183B (en) 2020-01-16 2020-01-16 Method and device for carrying out classification prediction by using text classification model
CN202010049397.9 2020-01-16

Publications (1)

Publication Number Publication Date
WO2021143396A1 true WO2021143396A1 (en) 2021-07-22

Family

ID=71025468

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/134518 WO2021143396A1 (en) 2020-01-16 2020-12-08 Method and apparatus for carrying out classification prediction by using text classification model

Country Status (2)

Country Link
CN (1) CN111291183B (en)
WO (1) WO2021143396A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554241A (en) * 2021-09-02 2021-10-26 国网山东省电力公司泰安供电公司 User layering method and prediction method based on user electricity complaint behaviors
CN113761935A (en) * 2021-08-04 2021-12-07 厦门快商通科技股份有限公司 Short text semantic similarity measurement method, system and device
CN114092949A (en) * 2021-11-23 2022-02-25 支付宝(杭州)信息技术有限公司 Method and device for training class prediction model and identifying interface element class
CN115795037A (en) * 2022-12-26 2023-03-14 淮阴工学院 Multi-label text classification method based on label perception
CN116561314A (en) * 2023-05-16 2023-08-08 中国人民解放军国防科技大学 Text classification method for selecting self-attention based on self-adaptive threshold
CN116611057A (en) * 2023-06-13 2023-08-18 北京中科网芯科技有限公司 Data security detection method and system thereof
CN116662556A (en) * 2023-08-02 2023-08-29 天河超级计算淮海分中心 Text data processing method integrating user attributes
CN113554241B (en) * 2021-09-02 2024-04-26 国网山东省电力公司泰安供电公司 User layering method and prediction method based on user electricity complaint behaviors

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291183B (en) * 2020-01-16 2021-08-03 支付宝(杭州)信息技术有限公司 Method and device for carrying out classification prediction by using text classification model
CN111340605B (en) * 2020-05-22 2020-11-24 支付宝(杭州)信息技术有限公司 Method and device for training user behavior prediction model and user behavior prediction
CN112395419B (en) * 2021-01-18 2021-04-23 北京金山数字娱乐科技有限公司 Training method and device of text classification model and text classification method and device
CN113806545B (en) * 2021-09-24 2022-06-17 重庆理工大学 Comment text emotion classification method based on label description generation
CN113838468A (en) * 2021-09-24 2021-12-24 中移(杭州)信息技术有限公司 Streaming voice recognition method, terminal device and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090248394A1 (en) * 2008-03-25 2009-10-01 Ruhi Sarikaya Machine translation in continuous space
CN110046248A (en) * 2019-03-08 2019-07-23 阿里巴巴集团控股有限公司 Model training method, file classification method and device for text analyzing
CN110163220A (en) * 2019-04-26 2019-08-23 腾讯科技(深圳)有限公司 Picture feature extracts model training method, device and computer equipment
CN110347839A (en) * 2019-07-18 2019-10-18 湖南数定智能科技有限公司 A kind of file classification method based on production multi-task learning model
CN110609897A (en) * 2019-08-12 2019-12-24 北京化工大学 Multi-category Chinese text classification method fusing global and local features
CN111291183A (en) * 2020-01-16 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for carrying out classification prediction by using text classification model

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710800B (en) * 2018-11-08 2021-05-25 北京奇艺世纪科技有限公司 Model generation method, video classification method, device, terminal and storage medium
CN111428520B (en) * 2018-11-30 2021-11-23 腾讯科技(深圳)有限公司 Text translation method and device
CN110134789B (en) * 2019-05-17 2021-05-25 电子科技大学 Multi-label long text classification method introducing multi-path selection fusion mechanism
CN110442707B (en) * 2019-06-21 2022-06-17 电子科技大学 Seq2 seq-based multi-label text classification method
CN110362684B (en) * 2019-06-27 2022-10-25 腾讯科技(深圳)有限公司 Text classification method and device and computer equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090248394A1 (en) * 2008-03-25 2009-10-01 Ruhi Sarikaya Machine translation in continuous space
CN110046248A (en) * 2019-03-08 2019-07-23 阿里巴巴集团控股有限公司 Model training method, file classification method and device for text analyzing
CN110163220A (en) * 2019-04-26 2019-08-23 腾讯科技(深圳)有限公司 Picture feature extracts model training method, device and computer equipment
CN110347839A (en) * 2019-07-18 2019-10-18 湖南数定智能科技有限公司 A kind of file classification method based on production multi-task learning model
CN110609897A (en) * 2019-08-12 2019-12-24 北京化工大学 Multi-category Chinese text classification method fusing global and local features
CN111291183A (en) * 2020-01-16 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for carrying out classification prediction by using text classification model

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761935B (en) * 2021-08-04 2024-02-27 厦门快商通科技股份有限公司 Short text semantic similarity measurement method, system and device
CN113761935A (en) * 2021-08-04 2021-12-07 厦门快商通科技股份有限公司 Short text semantic similarity measurement method, system and device
CN113554241A (en) * 2021-09-02 2021-10-26 国网山东省电力公司泰安供电公司 User layering method and prediction method based on user electricity complaint behaviors
CN113554241B (en) * 2021-09-02 2024-04-26 国网山东省电力公司泰安供电公司 User layering method and prediction method based on user electricity complaint behaviors
CN114092949A (en) * 2021-11-23 2022-02-25 支付宝(杭州)信息技术有限公司 Method and device for training class prediction model and identifying interface element class
CN115795037A (en) * 2022-12-26 2023-03-14 淮阴工学院 Multi-label text classification method based on label perception
CN115795037B (en) * 2022-12-26 2023-10-20 淮阴工学院 Multi-label text classification method based on label perception
CN116561314A (en) * 2023-05-16 2023-08-08 中国人民解放军国防科技大学 Text classification method for selecting self-attention based on self-adaptive threshold
CN116561314B (en) * 2023-05-16 2023-10-13 中国人民解放军国防科技大学 Text classification method for selecting self-attention based on self-adaptive threshold
CN116611057A (en) * 2023-06-13 2023-08-18 北京中科网芯科技有限公司 Data security detection method and system thereof
CN116611057B (en) * 2023-06-13 2023-11-03 北京中科网芯科技有限公司 Data security detection method and system thereof
CN116662556A (en) * 2023-08-02 2023-08-29 天河超级计算淮海分中心 Text data processing method integrating user attributes
CN116662556B (en) * 2023-08-02 2023-10-20 天河超级计算淮海分中心 Text data processing method integrating user attributes

Also Published As

Publication number Publication date
CN111291183B (en) 2021-08-03
CN111291183A (en) 2020-06-16

Similar Documents

Publication Publication Date Title
WO2021143396A1 (en) Method and apparatus for carrying out classification prediction by using text classification model
US11270225B1 (en) Methods and apparatus for asynchronous and interactive machine learning using word embedding within text-based documents and multimodal documents
CN109101537B (en) Multi-turn dialogue data classification method and device based on deep learning and electronic equipment
US20220019745A1 (en) Methods and apparatuses for training service model and determining text classification category
CN110046248B (en) Model training method for text analysis, text classification method and device
US8331655B2 (en) Learning apparatus for pattern detector, learning method and computer-readable storage medium
US20160140425A1 (en) Method and apparatus for image classification with joint feature adaptation and classifier learning
CN111191791A (en) Application method, training method, device, equipment and medium of machine learning model
CN112015868A (en) Question-answering method based on knowledge graph completion
CN112785441B (en) Data processing method, device, terminal equipment and storage medium
CN113268609A (en) Dialog content recommendation method, device, equipment and medium based on knowledge graph
US20180137410A1 (en) Pattern recognition apparatus, pattern recognition method, and computer program product
CN114936623A (en) Multi-modal data fused aspect-level emotion analysis method
CN115222998B (en) Image classification method
US10733483B2 (en) Method and system for classification of data
CN112988970A (en) Text matching algorithm serving intelligent question-answering system
CN110543566B (en) Intention classification method based on self-attention neighbor relation coding
CN111950647A (en) Classification model training method and device
CN111339734A (en) Method for generating image based on text
CN116258938A (en) Image retrieval and identification method based on autonomous evolution loss
CN115293818A (en) Advertisement putting and selecting method and device, equipment and medium thereof
CN114970882A (en) Model prediction method and model system suitable for multiple scenes and multiple tasks
CN115017321A (en) Knowledge point prediction method and device, storage medium and computer equipment
CN111339303A (en) Text intention induction method and device based on clustering and automatic summarization
CN113343666B (en) Method, device, equipment and storage medium for determining confidence of score

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20913983

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20913983

Country of ref document: EP

Kind code of ref document: A1