CN114398488A

CN114398488A - Bilstm multi-label text classification method based on attention mechanism

Info

Publication number: CN114398488A
Application number: CN202210047500.5A
Authority: CN
Inventors: 唐宏; 刘杰; 甘陈敏; 彭金枝; 刘蓓明
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-04-26

Abstract

The invention belongs to the field of natural language processing and multi-label text classification, and particularly relates to a BILSTM multi-label text classification method based on an attention mechanism; respectively embedding words in text data and label data through bert and Word2 vec; respectively extracting context information of text data and label data after word embedding by adopting a BILSTM module to obtain text representation and label representation; obtaining, by an attention mechanism module, a tag-based textual representation; training a multi-label text classification model through a loss function; inputting the real-time data into a trained multi-label text classification model to obtain a label classification prediction result of the real-time data; the invention utilizes the Bert to embed words, utilizes the BILSTM to extract the context dependency relationship, and fully utilizes the text and the text, the text and the label as well as the information of the label and the label, thereby improving the accuracy rate of multi-label text classification and the normalized breaking loss cumulative gain.

Description

Bilstm multi-label text classification method based on attention mechanism

Technical Field

The invention belongs to the field of natural language processing and multi-label text classification, and particularly relates to a BILSTM multi-label text classification method based on an attention mechanism.

Background

The text is one of important carriers of the information, the theme and the scale of the text information are various and have great difference, and how to efficiently process the text information is a problem with great research significance, and the rapid development of the automatic text classification technology is promoted. Text classification is an important and classical problem in Natural Language Processing (NLP), and in the conventional text classification problem, each sample has only one category label, and the category labels are independent from each other, and the classification granularity is relatively rough, which is called single-label text classification; with the increasing richness of text information, the degree of classification granularity refinement is higher and higher, one sample is related to a plurality of class labels, and meanwhile, certain dependency exists among the class labels, which is called multi-label text classification.

The multi-label text classification is an important branch of the multi-label classification, and the multi-label text classification method is divided into two main categories: conventional machine learning methods and deep learning based methods. The first conventional machine learning method includes a problem transformation method and an algorithm adaptive method; the second deep learning-based method is to process the multi-label text classification problem by using various neural network models, and classify the multi-label text classification problem into a convolutional neural network structure-based multi-label text classification method, a cyclic neural network structure-based multi-label text classification method and a transform structure-based multi-label text classification method according to the structure of the network. There are a number of studies on multi-label text classification, but there are still several problems:

1. the correlation between labels is studied. The labels in the multi-label text classification problem are inherently related, and the relevance among the labels is often not considered in the conventional methods for processing the multi-label text classification problem, so that the multi-label text classification efficiency is not high.

2. The research on the relevance of the document content and the label content has the defects that the fusion effect of the document content and the label content is poor, and the classification precision is influenced.

Disclosure of Invention

In order to solve the problems, the invention provides a BILSTM multi-label text classification method based on an attention mechanism, which comprises the following steps of constructing a multi-label text classification model, wherein the multi-label text classification model comprises a bert model, a Word2vec model, a BILSTM module and an attention mechanism module:

s1, Word embedding is carried out on text data through a bert model, and Word embedding is carried out on label data through a Word2vec model;

s2, extracting context information from the text data and the label data after word embedding through a BILSTM module to obtain text representation and label representation;

s3, processing the text representation and the label representation by adopting an attention mechanism module to obtain a text representation based on a label;

s4, calculating the loss of the text representation based on the label through a loss function until convergence to obtain a trained multi-label text classification model;

and S5, inputting the real-time data into the trained multi-label text classification model to obtain a label classification prediction result of the real-time data.

Further, step S2 uses the BILSTM module to learn the text data and the tag data after word embedding, and obtains the text representation and the tag representation, which are expressed as:

wherein, H is a text representation,

in order to be represented in a forward text direction,

for the purpose of the reverse text representation,

representing a forward text representation at time step p,

representing the reverse text representation at time step p, H' is the tag representation,

in order to be represented by the positive-going label,

in order to be represented by the reverse label,

representing the positive label representation at time step p,

denotes the reverse label representation at time step p, R denotes the dimension range, H belongs to R^2k×nH' belongs to R^2k×lWherein:

wherein the content of the first and second substances,

and

all belong to R^kK denotes the size of the LSTM hidden layer, V_tAn embedded vector for the t-th word in the text data,

representing a forward text representation at time step p-1,

representing the reverse text representation, V, at time step p-1_t"is the embedded vector of the t-th word in the tag data,

representing a positive label representation at time step p-1,

representing the reverse label representation at time step p-1.

Further, the step S3 of obtaining the tag-based text representation includes:

s11, sending the text representation into a self-attention mechanism to obtain a label document representation under the self-attention mechanism;

s12, sending the label data and text representation after word embedding into a label attention mechanism to obtain document representation of all labels;

s13, fusing the label document representation under the self-attention mechanism obtained in the step S11 and the document representation which is obtained in the step S12 and is subjected to all labels to obtain a fused document representation;

and S14, sending the label text into a self-attention mechanism for processing, and fusing the processing result with the fusion document representation of S13 to obtain a text representation based on the label.

Further, step S11 obtains a linear combination of each tagged context word in the text data by using the tag attention score, and obtains a tag document representation of the text representation under the self-attention mechanism according to the linear combination of each tagged context word, where the tag attention score and the linear combination of each tagged context word are respectively represented as:

A^(s)＝softmax(W₂tanh(W₁H))；

wherein A is^(s)For label attention scoring, R represents a range of dimensions,

W₁、W₂as a self-attention parameter, d_aFor a hyper-parameter, H is a text representation, tanh () is an activation function,

representing the contribution of all words to the jth label,

for along the jth tag under the self-attention mechanismIs represented by a tag document of H^TA transpose matrix for the text representation H.

Further, the step S12 obtaining the document representation with all tags includes:

converting label data after word embedding into a trainable matrix, constructing a semantic relation between text representation and the trainable matrix by linearly combining context words of labels, and acquiring document representation passing through all labels according to the semantic relation between the text representation and the trainable matrix

Wherein:

c represents a trainable matrix of label data after word embedding, R represents a dimension range, C belongs to R^l×k，

In order to be represented in a forward text direction,

for the purpose of the reverse text representation,

for a positive representation of the context words of the linear combination label,

for the reverse representation of the context words of the linear combination labels,

for a forward representation of the document representation via all tags,

a transpose matrix for the forward text representation,

being a reverse representation of the document representation across all tags,

represented as a transposed matrix of the inverted text representation.

Further, the fusion process of step S13 includes:

wherein M is_jFor the first fused document representation along the jth tag,

for the tag document representation along the jth tag,

for a j-tagged document representation, α_jFor self-attention weighting, L_αIs a first parameter, L_βIs the second parameter.

Further, the process of obtaining a tag-based text representation includes:

s21, capturing the dependency relationship of each label in the label text through a self-attention mechanism to obtain the attention score of the label word of the label text;

s22, acquiring a linear combination of each label according to the attention score of the label words of the label text, and obtaining label representation specific to the label under the self-attention mechanism through the linear combination of each label;

and S23, fusing the label representation specific to the label under the self-attention mechanism with the fused document representation to obtain a text representation based on the label.

Further, before the fusion in step S23, the fused document representation is processed through a full link layer to obtain a first text, the tag representation is processed through a full link layer to obtain a second text, and the first text and the second text are fused to obtain a tag-based text representation, where the processing formula is:

a＝sigmoid(W₅M)

d＝sigmoid(W₆M`^(s))

z＝BN[a,d]

wherein a is a first text, d is a second text, M is a fusion document representation, and M ″^(s)For label representation, BN [ ·]For batch normalization, z is a label-based text representation, W₅、W₆Is a weight value.

Further, a predictive probability of the classification is calculated by a sigmoid function based on the tag-based textual representation

Expressed as:

where reshape (. circle.) is the reshape function, b is the offset, W₇For weights, sigmoid () is the sigmoid function, z^TIs a transpose of the label-based text representation.

Further, the loss function is expressed as:

wherein N is the total number of text data, l is the total number of tag data,

to predict the probability, y_ijE {0,1} represents the classification accuracy of the ith document along the jth label.

The invention has the beneficial effects that:

the invention utilizes the Bert model to embed words in text data, and uses the Word2vec model to embed words in label data, and converts the label data after Word embedding into a trainable matrix, thereby solving the relation between text and label, better improving classification precision, enhancing the sensitivity of label in text, the label data generally has dozens or more than 100 labels compared with the text data, the complexity is reduced by processing label data with Word2vec, the processing speed is improved, the BILSTM is used for extracting context dependency after Word embedding, the text and label are respectively processed by using the self-attention machine, the correlation between text and label is improved, in addition, the text context relation extracted by BILSTM and the label after Word embedding are processed by using the label attention machine, the correlation between text and label is improved, the text under the self-attention machine is processed, And the text and the label under the label attention mechanism are fused, and the fusion result is fused with the label under the self-attention mechanism again, so that the accuracy of multi-label text classification and the normalized breaking cumulative gain are improved.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a mechanism diagram of the LSTM of the present invention;

FIG. 3 is a diagram of the bi-directional LSTM model architecture of the present invention;

FIG. 4 is a block diagram of a BILSTM multi-label text classification method based on an attention mechanism according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A BILSTM multi-label text classification method based on attention mechanism, as shown in fig. 1, comprising the following steps:

Specifically, a general block diagram of the bilst text classification method based on the attention mechanism is shown in fig. 4, the multi-label text classification model includes a bert model, a Word2vec model, a bilst module, and an attention mechanism module, and the specific implementation flow includes:

s11, inputting text data into a bert model for Word embedding, and performing Word embedding on tag data through a Word2vec model;

s12, extracting context information from the text data and the label data which are subjected to word embedding by adopting a BILSTM model to obtain text representation and label representation;

s13, sending the text representation into a self-attention mechanism to obtain a label document representation under the self-attention mechanism;

s14, sending the label data and the text representation after word embedding into a label attention mechanism to obtain document representations of all labels;

s15, fusing the label document representation under the self-attention mechanism obtained in the S13 and the document representation with all labels obtained in the S14 to obtain a fused document representation, namely A in FIG. 4;

s16, sending the label text into a self-attention mechanism for processing, and fusing the processing result with the fusion document representation of S15 to obtain a text representation based on the label, namely B in FIG. 4;

and S17, processing the text representation based on the label through a sigmoid function to obtain a final label classification prediction result.

In one embodiment, text data is input into a bert model, which processes the text data through word embedding, sentence embedding, and position embedding in sequence to obtain a text output vector containing words, sentences, and positions, which is denoted as { V }₁,V₂,...,V_p,...,V_nN represents the maximum embedded word length, in this embodiment, n is 300, and the dimension of the bert model is set to 768;

word embedding is carried out on tag data through Word2vec, the embedded dimensionality is set to be k, and the tag data are used for { V'₁,V′₂,…V′_p,…,V′_lDenotes the tag output vector, l denotes the number of embedded tags, and k is 300 in this embodiment.

In an embodiment, text data and tag data after embedding a blst model learning word are used to obtain a text representation and a tag representation, where the structure of the LSTM is shown in fig. 2, and various operations in the LSTM structure are represented as:

D_f＝sigmoid(W_f[x_i,s_t-1]+b_f)；

D_in＝sigmoid(W_in[x_i,s_t-1]+b_in)；

C_t＝tanh(W_c[x_i,s_t-1]+b_c)；

C_t＝D_f*C_t-1+D_in*C_t；

D₀＝sigmoid(W_o[x_i,s_t-1]+b_o)；

s_t＝D₀*tanh(C_t)；

wherein x is_iRepresenting an input vector, W_f、W_in、W_cRespectively representing the weight of a forgetting gate, the weight of an input gate and the weight of an input unit at the moment t. b_f、b_in、b_cRespectively showing the left gate offset, the input unit offset at time t, and C_t-1Indicating information on the state of the cells at time t-1, D_fAnd D_inRespectively representing the output of the forgetting gate and the output of the input gate, C_tIndicating cell state input at time t, C_tRepresenting updated cell state information, D₀Representing output gate output, s_tIndicating the hidden layer state at time t.

BILSTM operation is carried out on the basis of LSTM:

preferably, the text data and the tag data after embedding the learning words by using the BILSTM model are used to obtain a text representation and a tag representation, which are represented as follows:

wherein, H is a text representation,

in order to be represented in a forward text direction,

for the purpose of the reverse text representation,

representing a forward text representation at time step p,

in order to be represented by the positive-going label,

in order to be represented by the reverse label,

representing the positive label representation at time step p,

wherein the content of the first and second substances,

and

representing a forward text representation at time step p-1,

representing a positive label representation at time step p-1,

represents the reverse label representation at time step p-1;

as shown in fig. 3, the text data and tag data embedded by the word are learned by using the blstm, and at a time step p, the hidden state can be updated by inputting and outputting (p-1) step, where k represents the size of the LSTM hidden layer and is set to be 300 in this embodiment.

A multi-labeled document may be tagged with multiple labels, each document should have the most relevant context to its corresponding label, in other words, each document may contain multiple labels, with words in one document contributing differently to each label.

In one embodiment, in order to obtain different contributions of each tag, a self-attention mechanism is adopted, which specifically includes: linear combinations of each label context word in the text data are obtained by adopting label attention scores, label document representations of the text representations under a self-attention mechanism are obtained according to the linear combinations of each label context word, and the label attention scores and the linear combinations of each label context word are respectively represented as follows:

A^(s)＝softmax(W₂tanh(W₁H))；

wherein A is^(s)In order to score the attention of the tag,

W₁and W₂Representing the self-attention parameter to be trained, d_aIn this example d is a hyper-parameter_aAt 200, H is a text representation, tanh () is an activation function,

representing the contribution of all words in the text data to the jth label,

for the tag document representation under the autofocusing mechanism along the jth tag, H^TA transpose matrix for the text representation; finally, label document representation M of text representation under the self-attention mechanism is obtained^(s)，M^(s)∈R^l×2k。

In order to utilize semantic information of the label, after Word2vec preprocessing is carried out on the label, the label is expressed as a trainable matrix C e R^l×kI.e. the same potential k-dimensional space as the wordWith the embedding of tags in conjunction with the embedding of text words via the BILSTM, the semantic relationship between each pair of words and tags can be determined explicitly.

In one embodiment, a semantic relationship between a text representation and word-embedded tag data is constructed by linearly combining context words of tags, and a document representation of all tags is obtained from the semantic relationship between the text representation and the word-embedded tag data

Wherein:

c represents a trainable matrix of label data after word embedding, C belongs to R^l×k，

In order to be represented in a forward text direction,

for the purpose of the reverse text representation,

for a forward representation of the document representation via all tags,

a transpose matrix for the forward text representation,

being a reverse representation of the document representation across all tags,

represented as a transposed matrix of the inverted text representation.

M^(s)And M^(l)Are all tag-specific representations of documents, but they are different, M^(s)With emphasis on document content, M^(l)More inclined to semantic association between document content and label text, two weight vectors (alpha, beta epsilon to R) are introduced^l) To determine the importance of the two parts, which are input to M^(s)And M^(l)Obtaining the full connection layer;

for the tag document representation along the jth tag,

for a j-tagged document representation, α_jFor self-attention weighting, L_αIs a first parameter, L_βFor the second parameter, a constraint is added to the two weight parameters, and a first fused document representation along the jth label is obtained according to the fused weight, namely

Obtaining a fused first fusion document representation M through the introduced self-attention weight value and according to the fusion weight method, wherein M belongs to R^l×2k。

In one embodiment, the tag score of each tag is obtained, so as to obtain a tag-specific tag representation based on the self-attention mechanism, and further obtain a tag-based text representation, and the specific steps are as follows:

s23, fusing the document representation which is specific to the label under the self-attention mechanism with the document representation which is the label under the self-attention mechanism to obtain a text representation based on the label.

Specifically, the tag word attention score of the tag text is expressed as:

wherein the content of the first and second substances,

W₃、W₄as a self-attention parameter, d_aFor hyper-parameter, H' is a label representation, in this embodiment d is set_a＝200；

Specifically, step S22 includes:

representing the contribution of all tags to the jth tag,

representing a label representation specific to the jth label in the self-attention mechanism, matrix M ″^(s)∈R^l×2kIs a label-specific label representation under the self-attention mechanism;

in this embodiment, the label attention mechanism is that various labels in the label data are used in the text information in the text data, and is an association between the text and the labels, so the label attention mechanism is called; the two-time self-attention mechanism is used in the present invention, the first time based on the association between text and text, specifically, the association between text content and tags in text, and the second time based on the association between tags and tags.

The adopted fusion mode has the advantages that under the condition that dimensionality is not changed, training speed can be accelerated better, and the dependence relationship among parameters is reduced. The specific formula is as follows:

a＝sigmoid(W₅M)

d＝sigmoid(W₆M`^(s))

z＝BN[a,d]

W₅∈R^1×l,W₆∈R^1×l,a∈R^1×2k,b∈R^1×2k,z∈R^1×4kand z is a tag-based text representation.

Once there is a comprehensive label-specific document representation, matrix remodeling can be performed through a reshape function to obtain vectors of l rows and l columns, and then output is performed through a final sigmoid function, and mathematically, the prediction probability of each label can be calculated as follows:

W₇∈R^l×4k；

where reshape (. circle.) is the reshape function, b is the offset, W₇For weights, sigmoid () is the sigmoid function, z^TFor the transposition of the label-based text representation, the output value is converted into a probability by using a sigmoid function, and the cross entropy loss can be used as a loss function:

wherein N is the number of training documents, l is the number of labels,

to predict the probability, y_ijE {0,1} represents the classification accuracy of the ith document along the jth label, W₅For full link layer parameters, W₇Sigmoid () is a sigmoid function for output layer parameters.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The BILSTM multi-label text classification method based on the attention mechanism is characterized by constructing a multi-label text classification model, wherein the multi-label text classification model comprises a bert model, a Word2vec model, a BILSTM module and an attention mechanism module, and the BILSTM multi-label text classification method based on the attention mechanism comprises the following steps:

2. The method of claim 1, wherein step S2 learns the text data and label data after word embedding by using the blst module to obtain the text representation and label representation, which is expressed as:

wherein, H is a text representation,

in order to be represented in a forward text direction,

for the purpose of the reverse text representation,

representing a forward text representation at time step p,

is represented by a forward tag, H' is represented by a reverse tag,

representing the positive label representation at time step p,

wherein the content of the first and second substances,

and

representing a forward text representation at time step p-1,

representing a positive label representation at time step p-1,

representing the reverse label representation at time step p-1.

3. The method of claim 1, wherein the step S3 of obtaining the label-based text representation comprises:

4. The method according to claim 3, wherein step S11 obtains the linear combination of each tagged contextual word in the text data by using the tag attention score, obtains the tagged document representation of the text representation under the self-attention mechanism according to the linear combination of each tagged contextual word, and the tag attention score and the linear combination of each tagged contextual word are respectively expressed as:

A^(s)＝softmax(W₂tanh(W₁H))；

wherein A is^(s)In order to score the attention of the tag,

representing the contribution of all words to the jth label,

along the jth tagLabel document representation under the self-attention mechanism, H^TA transpose matrix for the text representation H.

5. The method of claim 3, wherein the step S12 of obtaining the document representation with all labels comprises:

Wherein:

For forward text representation, H for reverse text representation,

for a forward representation of the document representation via all tags,

a transpose matrix for the forward text representation,

being a reverse representation of the document representation across all tags,

represented as a transposed matrix of the inverted text representation.

6. The method for classifying BILSTM multi-label text based on attention mechanism as claimed in claim 3, wherein the fusing procedure of step S13 includes:

wherein M is_jFor the first fused document representation along the jth tag,

for the tag document representation along the jth tag,

7. The method of claim 3, wherein obtaining the label-based text representation comprises:

8. The method of claim 7, wherein before the fusing in step S23, the fused document representation is processed through a full link layer to obtain the first text, the label representation is processed through a full link layer to obtain the second text, and the first text and the second text are fused to obtain the label-based text representation, according to the following processing formula:

a＝sigmoid(W₅M)

d＝sigmoid(W₆M`^(s))

z＝BN[a,d]

9. The system of claim 8, wherein the BILSTM multi-label text is based on the attention mechanismClassification method characterized in that, on the basis of a tag-based text representation, the prediction probability of a classification is calculated by means of a sigmoid function

Expressed as:

10. The method of claim 1, wherein the penalty function is expressed as:

wherein N is the total number of text data, l is the total number of tag data,