CN111309918A

CN111309918A - Multi-label text classification method based on label relevance

Info

Publication number: CN111309918A
Application number: CN202010185642.9A
Authority: CN
Inventors: 李熠; 袁进
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-03-17
Filing date: 2020-03-17
Publication date: 2020-06-19

Abstract

The invention discloses a multi-label text classification method based on label relevance. The invention adds a attention mechanism, namely a label attention mechanism, on the basis of the Seq2Seq model. Unlike the text attention mechanism carried by Seq2Seq, it is used to mine the association between tags, and the text attention mechanism is used to establish the relationship between text and tags to help the Decoder classification by highlighting the key text information. The tag attention mechanism provides more information for the Decoder module by mining complex associations between tags. Meanwhile, the method can help the model to avoid error transmission, so that the classification effect of the model can be improved.

Description

Multi-label text classification method based on label relevance

Technical Field

The invention belongs to the technical field of software, and particularly relates to a multi-label text classification method based on label relevance.

Background

On the internet, people often classify messages or topics, for example, in the field of text topic classification such as news media, watch, tiger, etc., a piece of news may be related to both computers and social issues, and in this case, the topic should belong to two categories: "computer" and "society". If only people are relied on to label, the method cannot be completed when massive news or post classification in real life is processed.

The common method is processed by some machine learning algorithm models and can be divided into two categories, problem transformation type and algorithm adaptation type. The problem transformation is to transform the multi-label classification problem into a two-classification or multi-classification problem, and then solve the problem by using some existing machine learning algorithms, and a representative model of the method is Binary Relevance algorithm Classifier Chains and the like. The algorithm adaptation type is that some existing machine learning algorithms (such as SVM, kNN) are modified to be directly used for the multi-label classification problem, and the representative algorithm includes: Rank-SVM and ML-kNN. The method based on machine learning is limited by text extraction capability, and cannot well extract text semantic information, so that the classification accuracy is limited.

Recently, as the research of neural networks is deepened, more and more deep neural network models are applied to the multi-label text classification problem, and the models based on LSTM and CNN are more common. The deep neural network has strong feature extraction capability and can well understand semantic information of texts, so that important help is provided for classification. However, they also have their own drawbacks, such as that CNN-based models suffer from file size, and only extract document local information, while disregarding global information.

The Seq2Seq model improved based on the LSTM model solves the task of multi-label text classification, and is widely applied to the fields of natural language processing such as machine translation, text classification and the like. Seq2Seq comprises two modules: an Encoder module and a Decode module. The Encode model is used for extracting text information, and the Decode module is used for carrying out serialized classification according to the text information extracted by the Encode module. In the Decoder module, there is a text attention mechanism, which assigns a weight to each word in the text according to the hidden state at the current time, so that the word carrying important information is highlighted, thereby helping the classifier to make more accurate classification. It predicts a tag at each time t until a terminator is encountered. However, this model has two drawbacks: firstly, because the prediction at the time t depends on the output result at the time t-1, if an error occurs at the time t-1, the updating of the hidden state at the time will be in error, and further, the text attention mechanism is wrongly assigned with weights, so that the prediction error is likely to occur at the time t. And more importantly, the error is always transmitted, and the classification effect of the model is greatly influenced. Secondly, the relevance among the labels cannot be considered by the common Seq2Seq model, and the method of ignoring the relevance of the labels may cause wrong classification; for example, the tag "computer" has a greater association with "artificial intelligence" and a lesser association with the tag "social". How to learn the complex relevance among the labels is of great importance to improving the classification accuracy of the model.

The noun explains:

embedding matrix: because the initial word is a one-hot vector, namely only one position is 1, and other positions are 0, the sparsity is too large, and semantic features of the word cannot be embodied, in order to facilitate model learning, an Embedding matrix is converted into a dense vector, the dimensionality of the Embedding matrix is K V, the dimensionality of each converted word is K, and V is the size of a vocabulary.

An Encoder module: the module for extracting document information in the Seq2Seq model is composed of a Bi-LSTM network,

a Decoder module: the Seq2Seq model is a module for decoding document information and predicting tags, receives the document information extracted by the Encoder module, and then predicts according to the information, which is generally an LSTM network;

softmax function: for normalization of the prediction results, because the model predicts a score belonging to each label, which may have a negative value, softmax can uniformly scale them to the (0,1) interval, and thus can be expressed as a probability belonging to each label;

beam Search algorithm: in a general greedy strategy, only the label with the highest probability is selected as a prediction result at each moment of a Decoder module, so that an optimal solution may not be found. The Beam Search algorithm selects the first k labels with the maximum probability at each moment as the prediction result of the current moment, so that the optimal solution can be more likely to be found.

Disclosure of Invention

In order to solve the problems, the invention discloses a multi-label text classification method based on label relevance. Meanwhile, the label attention can also help the model to avoid the problem of error transmission, so that the classification effect of the model is effectively improved.

A multi-label text classification method based on label relevance comprises the following steps:

reading a document consisting of n words, mapping each word in the document into a vector, and representing the document as a matrix with K multiplied by n dimensionalities; wherein K represents a dimension of a word vector;

step two, sequentially reading each coded word w _ i by using a Seq2Seq model, obtaining a corresponding vector u _ i for representing information contained in the word w _ i, and obtaining n vectors which are respectively represented as { u _1, u _2, … and u _ n }; i is more than or equal to 1 and less than or equal to n; inputting each coded word w _ i in the document into an encoder, and inputting the coded word w _ i into a decoder after the encoder codes the document;

step three, setting a label attention mechanism to predict the labels { y) at the previous t-1 moments at each moment t of the decoder₁,y₂,…,y_(t-1)Y as input for each label according to the relevance between the labels_jAssign the weight, y_jRepresenting the text u encoded by the decoder at time j for the encoder₁,u₂,…,u_nPredicted label, j<t；

Step four, the self text attention mechanism of the Seq2Seq model is according to each coded word { u₁,u₂,…,u_nThe information quantity carried by the text word is distributed with different weights, the corresponding relation between the text word and the label is established, and finally a vector c is obtained_tRepresenting a vector of text information;

obtaining a text information vector c_tAnd tag association information vector gamma_tThe decoder then uses the text information vector c_tAnd tag association information vector gamma_tInputting the label data into a full-connection network, normalizing the label data through a softmax function, outputting the probability of each label, and selecting the label with the maximum probability as a predicted label pred _ t at the current moment;

step five, comparing the difference between the predicted label pred _ t and the real label by the Seq2Seq model, updating parameters in the network through a back propagation algorithm of the neural network, and performing iterative training until a loss function is converged to obtain a trained multi-label text classification model;

and step six, inputting the text to be labeled into the trained multi-label text classification model, and selecting the k labels with the highest probability from the output as the labels of the text to be labeled for classification.

In a further improvement, in the first step, each word in the document is mapped to a vector by using an Embedding matrix.

In a further improvement, in the third step, the predicted labels { y ] at the previous t-1 moments₁,y₂,…,y_(t-1)The calculation formula as input is as follows:

score_jt＝E_y(y_j)V_ah_t

wherein h is_tRepresenting hidden states at the moment t of the decoder, E_y(y_j) Represents the vector corresponding to the jth time tag, and V_aAnd the expression parameter matrix is used for learning the relevance among the labels, a weight is assigned to each label, and the label attention mechanism outputs a vector to express the extracted label relevance information: score_jtIndicating that the label attention mechanism predicts the label y at a time j before the time t_jAssigned weight score, β_jtIs score_jtNormalizing the result;

γ_tis the vector finally output by the label attention mechanism, contains the correlation between labels, and is formed by multiplying the labels { y _1, y _2, …, y _ (t-1) } at all times before the t time by the corresponding weight β_jtObtaining;

label association information vector gamma mined by label attention mechanism_tIs input into the decoder to update the decoder hidden state.

In a further improvement, the loss function is a cross entropy loss function, and the calculation method of the cross entropy loss function is as follows:

wherein Loss represents a cross entropy Loss function; d represents the number of documents contained in the data set, and L represents the number of tags contained in the data set; y is_dlThe true value of the ith document at the ith label is represented; if the d document belongs to the l tag, here 1, otherwise 0; p is a radical of_dlIs the probability of prediction at the l-th label for the d-th document.

In a sixth step, the probability that the text to be labeled corresponds to each label is calculated by adopting a Beam Search algorithm; the Beam Search algorithm accumulates the selection probability of each text to be labeled to a certain label as the final selection probability.

Drawings

FIG. 1 is a flow chart of the Beam Search algorithm;

FIG. 2 is a flow chart of tag attention work;

fig. 3 is a flow chart of the operation of the present invention.

Detailed Description

The present invention will be described in more detail with reference to the accompanying drawings and examples.

The method comprises the following specific implementation steps:

1) for words and labels in a document needing to be correspondingly processed, an Embelling matrix is used for converting the words into a vector. Here we do not introduce an extra training set to train these word vectors, but instead update with other parameters in our model. Because the number of the actual words is large, only 5 ten thousand words with high frequency of occurrence are taken out, and a dictionary is established for the words, namely, one word corresponds to a specific vector. Other words with lower frequency of occurrence are replaced by "UNK", which also corresponds to a vector. This is because the words with low frequency of occurrence tend to carry a small amount of information, and in practical applications, the model is likely to encounter new words not in the dictionary, and in order to enhance the robustness of the model, we only take the words with high frequency of occurrence to build the dictionary. At the same time, we also introduce an end symbol "EOS", when the model meets the vector corresponding to this word, it tells the model that the document has been read. For the label corresponding to each document, we also process in the above manner, except that since the label set is generally small, we map each label to a vector, and add a label vector representing the end.

2) After the first step, each word in the text is represented as a vector, and each word in the document is input into the Encoder module to extract the semantic information of the document. The semantic information extracted by the Encode module is then input to the Decode module. Then, the label attention mechanism in the invention takes the label predicted at the previous moment as an input and generates a label correlation information vector. It is input to the decoder to provide the decoder with tag association information. After the hidden state of the decoder is updated, the text attention mechanism of the Seq2Seq model extracts the text information vector according to the hidden state at the current moment. Finally, the decoder classifies the text information provided by the text attention mechanism and the label association information provided by the label attention mechanism. Specifically, as shown in fig. 2, the label attention mechanism can assign different weights to labels output at different historical times, and assign a smaller weight to the label y _3 that is incorrectly predicted at time t _3, so that the phenomenon of "error propagation" in the model can be avoided.

3) It is worth mentioning that the label attention mechanism proposed by us needs to do some processing in the model training stage, because the objects assigned with the weights by the label attention mechanism are changing. For example, at time 3, the label to which the weight is to be assigned is { computer, social }, and at time 4, the label to which the weight is to be assigned is { computer, social, artificial intelligence }. In the training stage, all label sets corresponding to the document are input every time, so that in order to avoid the label attention mechanism from 'seeing' labels at the time t and after the time t, a mask is added to a real label set input into the label attention during training, and the label attention mechanism is prevented from acquiring labels at the later time.

Finally, the prediction label of each moment is compared with the real label, parameters in the model are optimized through an optimization function and combined with back propagation of the neural network, and therefore the purpose of learning the model is achieved.

The specific steps of the invention are shown in figure 3:

1) for a document { x _1, x _2, …, x _ n } composed of n words, we map it into a vector through an Embedding matrix, and the specific calculation formula is as follows:

w_i＝E×x_i

where E ∈ R ^ (K × V) is the Embedding matrix, K is the dimension of the word vector, and V is the size of the vocabulary. The vocabulary is constructed by sequencing the frequency of occurrence of each word in the data set and taking the first V words to construct the vocabulary. Those words with lower frequency of occurrence are collectively denoted by "UNK", which also has a corresponding vector. In general, when the data set is large, we can set V larger, so that the model has a larger vocabulary. However, if the vocabulary is too large, the model training speed may be affected. We also introduce a special symbol in the vocabulary of the text dataset: the end symbol "EOS" corresponds to a vector, and when the model reads the symbol, it indicates that the document content has been read. For a label set, we have similar processing, with the Embedding matrix E _ y ∈ R ^ (d × L), d represents the dimension of the label vector, and L is the size of the label dataset. In addition, due to different lengths of the documents, for the overlong text, truncation processing is carried out, namely a maximum value is set, the excessive part is discarded, and the insufficient part is filled and supplemented by an end symbol 'EOS'.

2) After passing through the Embedding matrix, we can represent a document as a matrix with dimensions K × n. The model reads each encoded word w _ i in turn and obtains a corresponding vector u _ i that represents the information contained in the word, such that a total of n such vectors are obtained, which we will represent as { u _1, u _2, …, u _ n }. These vectors are the textual information extracted by the Encoder module.

3) After the document is encoded by the Encoder module, the document enters the Decoder module. At each time t of the Decoder module, the label attention mechanism takes the labels { y _1, y _2, …, y _ (t-1) } predicted at the previous time t-1 as input, and assigns a weight to each label y _ i according to the relevance between the labels, which is calculated as follows:

score_jt＝E_y(y_j)V_ah_t

here h _ t is the hidden state at the moment t of the decoder, E _ y (y _ j) represents the vector corresponding to the tag at the j-th moment, V _ a is a parameter matrix which can learn the relevance between tags, so as to assign a weight to each tag, and finally, the tag attention mechanism outputs a vector to represent the extracted tag relevance information:

the label correlation information mined by the label attention mechanism is input into a decoder to update the hidden state;

4) and after the hidden state of the decoder is updated, different weights are distributed to the text attention mechanism carried by the Seq2Seq model according to the information quantity carried by each word, the corresponding relation between the text word and the label is established, the word with important information is highlighted, and finally a vector c _ t is obtained to represent the text information.

5) After the text information vector c _ t and the label associated information vector gamma _ t are obtained, the Decoder module considers the two information at the same time, inputs the two information into a full-connection network, normalizes the two information through a softmax function, outputs the probability of each label, and then selects the label with the maximum probability from the probabilities as the predicted label pred _ t at the current moment. The model will then update the parameters in the network by the back propagation algorithm of the neural network, comparing the difference between the predicted tag pred _ t and the real tag. The process of this comparison depends on an objective function, and in the patent, we choose a cross-entropy loss function:

and obtaining a trained multi-label text classification model after the steps. The method can be directly used in practical application, any text is input, the text passes through an Embedding layer, and each word is mapped into a vector. Then the model extracts text information and inputs the text information into a Decoder module, and classification is carried out by combining the association information between the labels captured by the label attention mechanism. In order to further improve the classification effect, the model can adopt a Beam Search algorithm in a test stage or practical application, which is different from a general greedy strategy, the Beam Search selects k labels with the highest probability as results each time in a prediction stage, and the final score of the k results is selected as a real output result after the prediction is finished. The specific Beam Search algorithm is shown in fig. 1: the real label combination is { "know", "artificial intelligence", "data mining" }. Green labeled indicates the 2 highest scoring results picked at each time by Beam Search, and gray labeled the labels that have been predicted at the previous time. The last moment is that the score is 1.40>1.35, so the above tag combination { know, artificial intelligence, data mining } is finally selected.

Using the data set: reuters Corpus Volume I (short "RCV 1-V2") for comparison: this data is an english news data set provided by the passerby. A total of over 800000 pieces of text data are contained in the data set, while 103 news topics are provided as a tag set. On average, each text in the dataset would be labeled with 3.24 labels, with the results shown in the following table: .

Hamming Loss: the number of labels for model misclassification is mainly evaluated, and the smaller the value is, the better the model classification effect is;

Micro-F1: is a harmonic mean for Precision (Precision) and Recall (Recall) used for the overall classification effect of the average model, the larger the value, the better

Precision: the ratio of the number of correct model classifications to the total number of labels;

recall (Recall): the method is used for evaluating whether the classification of the model is complete or not, omission easily occurs in the multi-label classification problem model prediction, namely, part of labels are not predicted and cannot be well measured by accuracy, the recall rate can be used for measuring the problem, and the larger the value is, the better the value is.

The above description is only one specific guiding embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modification of the present invention using this concept shall fall within the scope of the invention.

Claims

1. A multi-label text classification method based on label relevance is characterized by comprising the following steps:

2. The method for multi-label text classification based on label relevance according to claim 1, characterized in that in the first step, an Embedding matrix is used to map each word in the document into a vector.

3. The method as claimed in claim 1, wherein in step three, the predicted labels { y ] at t-1 previous moments are classified₁,y₂,…,y_(t-1)The calculation formula as input is as follows:

score_jt＝E_y(y_j)V_ah_t

γ_tis the vector finally output by the label attention mechanism, contains the correlation among labels, and is obtained by multiplying the labels { y _1, y _2, …, y _ (t-1) } at all the time before the time t by the corresponding labels { y _1, y _2, …, y _ (t-1) } at the time tWeight β of_jtObtaining;

4. The method for multi-label text classification based on label relevance according to claim 1, wherein in the fifth step, the loss function is a cross-entropy loss function, and the cross-entropy loss function is calculated as follows:

5. The method for classifying text according to claim 1, wherein in the sixth step, the text to be labeled adopts a Beam Search algorithm to calculate the probability that the text to be labeled corresponds to each label; the Beam Search algorithm accumulates the selection probability of each text to be labeled to a certain label as the final selection probability.