CN111522908A

CN111522908A - Multi-label text classification method based on BiGRU and attention mechanism

Info

Publication number: CN111522908A
Application number: CN202010275820.7A
Authority: CN
Inventors: 施凌鹏; 卢士达; 顾中坚; 李天宇; 张黎首; 刘逸逸; 李姝�; 黄静韬; 吴金龙; 沈邵骏
Original assignee: State Grid Shanghai Electric Power Co Ltd
Current assignee: State Grid Shanghai Electric Power Co Ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2020-08-11

Abstract

The invention provides a multi-label text classification method based on BiGRU and attention mechanism, which comprises the following steps: s1, acquiring a plurality of web texts; s2, preprocessing a plurality of web texts; s3, extracting deep information features of the web text by using the pre-training word vector; s4, adding corresponding weight to the deep information characteristics according to the attention mechanism; s5, carrying out probability classification on the data obtained in the step S4 by using a BiGRU; and S6, outputting the probability of each web text on different types of labels. The advantages are that: the method further shortens training time by adopting pre-training word vectors, and enables a neural network to focus on important information with improved classification effect by adopting an attention mechanism.

Description

Multi-label text classification method based on BiGRU and attention mechanism

Technical Field

The invention relates to the field of network text classification, in particular to a multi-label text classification method based on BiGRU and attention mechanism.

Background

Emotion recognition is one of the important topics for natural language processing. Today, people publish opinions on line through microblogs, news websites, forums and the like, and the publications of the opinions are indefinite in space, unlimited in vocabulary amount, free of strict grammar rules and strong in subjective tendency. Among them, negative speech is an important subject which needs to be paid attention to urgently, and if speech emotion cannot be recognized correctly, it is impossible to prevent the occurrence of network violence and prevent behaviors damaging individuals and even enterprise reputation. Under the background, the network text emotion recognition related to the power grid has high research significance.

The emotion recognition of a text is mainly based on two methods: the emotion polarity dictionary and the traditional machine learning method comprise the steps of construction of emotion resources, sentence segmentation, feature information extraction, quality analysis and the like. The coming of the network era promotes the emergence of a plurality of new words, has great influence on an emotion classification model based on an emotion polarity dictionary, the existing emotion polarity dictionary is limited, and the model cannot effectively recognize newly-generated words or popular words. Manek and Shenoy used traditional machine learning algorithms to analyze the sentiment of the comments, comparing the performance of naive bayes, AE (coders) and SVM (support vector machines) mainly in terms of accuracy and F-value. The result shows that the support vector machine has the best classification effect, and the deep neural network has excellent performance in natural language processing along with the development of deep learning research. Kim also uses Convolutional Neural Network (CNN) to solve the emotion recognition problem with good results. Santos uses a deep convolutional neural network to analyze the emotion contained in a text, and Irsoy proves that a long-short term memory network (LSTM) is an effective method for solving the emotion recognition of the text as a recursive neural network model. Bahdana u uses an attention model originally applied to machine translation in NLP (natural language processing). Qu and Wang provide an emotion analysis model based on a layered attention network, and the effect of the emotion analysis model is greatly improved compared with that of a traditional recurrent neural network. Tianshengwei et al combine the two-way LSTM with the attention mechanism to achieve good recognition of Uygur timing events. Zhang Yuhuan et al combine GRU and LSTM, make the classification model of the text emotion reach higher accuracy in a short time. And performing emotion analysis on the user evaluation paper by using the attention mechanism and a neural network model constructed by the BilSTM, extracting text features in word vectors by using the Bi-LSTM, and putting in an attention mechanism layer to highlight important information in text classification.

Disclosure of Invention

The invention aims to provide a multi-label text classification method based on BiGRU and attention mechanism, which comprises the steps of firstly obtaining a plurality of web texts; preprocessing a plurality of web texts; extracting deep information features of the web text by using the pre-training word vector; adding corresponding weight to the deep information characteristics according to an attention mechanism; performing label probability classification on the data of different categories by using a BiGRU; and outputting the probability of each piece of web text on different types of labels. The method adopts the pre-training word vectors and the BiGRU to greatly shorten the time required by training the neural network, and meanwhile, the accuracy of classification of the method is improved by using an attention mechanism.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a multi-label text classification method based on BiGRU and attention mechanism comprises the following steps:

s1, acquiring a plurality of web texts;

s2, preprocessing a plurality of web texts;

s3, extracting deep information features of the web text by using the pre-training word vector;

s4, adding corresponding weight to the deep information characteristics according to the attention mechanism;

s5, carrying out probability classification on the data obtained in the step S4 by using a BiGRU;

and S6, outputting the probability of each web text on different types of labels.

Preferably, the step S2 is specifically:

and extracting the first m characters of each web text as input neurons, and automatically filling the web texts with less than m characters by adopting spaces.

Preferably, the step S3 is specifically:

each input neuron is represented by a multi-dimensional word vector, the similarity between the word vectors is calculated, and two word vectors with higher similarity are combined into one word vector to serve as a deep information feature.

Preferably, the shorter the distance between two word vectors, the higher the similarity of the two word vectors.

Preferably, the step S4 specifically includes:

the data obtained in step S3 is weighted accordingly using the attention mechanism that introduces a query vector, calculates the direct correlation between the query vector and the input vector, i.e., the word vector, by means of the attention scoring function, and introduces an attention variable representing the selected index position.

Preferably, the step S5 includes a category 6 tag.

Preferably, the tag comprises:

"toxic", "segment _ toxic", "obscene", "threat", "install", and "identity _ state".

Preferably, in the step S6,

and outputting the probability of each web text on different types of labels by adopting a full connection layer.

Compared with the prior art, the invention has the following advantages:

(1) the invention relates to a multi-label text classification method based on BiGRU and attention mechanism, which comprises the steps of obtaining a plurality of web texts; preprocessing a plurality of web texts; extracting deep information characteristics of the web text by using pre-training word vectors; adding corresponding weight to the deep information characteristics according to an attention mechanism; performing probability classification on the data by using a BiGRU (binary-generalized Unit); the probability of each web text on different types of labels is output to obtain the label classification of the web text, the method adopts pre-training word vectors and BiGRU to greatly shorten the time required by training a neural network, meanwhile, the accuracy of the method classification is improved by using an attention mechanism, and compared with a baseline model BiLSTM, the method has wide applicability and can be applied and deployed in web text information related to a power grid;

(2) according to the multi-label text classification method based on the BiGRU and the attention mechanism, the training speed is further improved by adopting the pre-training word vectors and the BiGRU, the training time is also shortened, the attention mechanism is adopted, so that a neural network can focus on important information for improving the classification effect, and compared with the prior art, the BiGRU and the attention mechanism are fused, so that the method can obtain the same high accuracy rate by using less training time.

Drawings

FIG. 1 is a network architecture for multi-label text classification according to an embodiment of the present invention;

fig. 2 is a diagram illustrating a result of the data set label classification in the embodiment of the present invention.

Detailed Description

The present invention will now be described in further detail by way of a preferred embodiment thereof, with reference to the accompanying drawings.

The invention provides a multi-label text classification method based on a bidirectional gated recurrent neural network (BiGRU) and an attention machine system, which is an important component of natural language processing and aims at the problems that a text does not have a fixed grammar and a writing format and emotion information is dispersed in each position of the text in network text emotion recognition related to a power grid, and the method comprises the following steps:

and S1, acquiring a plurality of web texts.

And S2, preprocessing the plurality of web texts.

In the embodiment, the first m characters of each web text are extracted as the input neurons, and the web texts with less than m characters are automatically filled with spaces.

And S3, extracting deep information features of the web text by using the pre-training word vector.

The step S3 specifically includes: each input neuron is represented by a multi-dimensional word vector, the similarity between the word vectors is calculated, and two word vectors with higher similarity are combined into one word vector to serve as a deep information feature. Wherein the shorter the distance between two word vectors, the higher the similarity between the two word vectors. By adopting the method, the network text can be better displayed, and the subsequent training time of the network text can be shortened.

And S4, adding corresponding weight to the deep information characteristics according to the attention mechanism.

The step S4 specifically includes: the data obtained in step S3 is weighted accordingly using the attention mechanism that introduces the query vector, calculates the direct correlation between the query vector and the input vector, i.e., the word vector, by the attention scoring function, and introduces the attention variable representing the selected index position.

Attention mechanism was first introduced in computer vision, which was inspired by the process of human vision processing, i.e. the human brain does not process all visual information but focuses on specific parts, and this mechanism can be applied in a variety of fields, including image title generation, text classification, speech recognition and machine translation.

In a neural network, the attention mechanism can be regarded as a resource allocation scheme, and more attention or computing resources are allocated to important information, which is beneficial to solving the information overload problem. In practice, attention mechanisms can be generally divided into two categories: one is top-down focused attention, usually conscious and task-related, actively focusing on a certain object; the other is bottom-up unintentional attention, which is unrelated to tasks and is driven primarily by the outside world, also known as salience-based attention. For example, in Convolutional Neural Networks (CNNs), Pooling (Pooling) and gating mechanisms can be considered significance-based attention mechanisms.

In the present embodiment, as described in step S3, the input data is represented using a word vector, and [ x ] is adopted₁,…,x_N]An input vector (namely input neuron) representing task correlation, N is the number of the input vector, in order to give more weight to specific data, an attention mechanism introduces a query vector q (query vector), calculates the direct correlation between the query vector and the input vector by an attention scoring function, and introduces an attention variable t ∈ [1, N]Representing the selected index position. The specific calculation method is as follows:

wherein, α_iIs the attention distribution, s (x)_iQ) is an attention scoring function, softmax is a normalized exponential function, and p is a custom function. The attention scoring function can be defined in various ways, and in the embodiment, we adopt self-attention based on the scaling dot productForce model, scaled dot product is defined as follows:

where d represents the dimension of the input vector. The scaling dot product model is an improvement based on the dot product model, and is different in that the scaling dot product model is divided by the square root of the vector dimension d, when d is large, the value of the dot product model has a large variance, so that the gradient of softmax is reduced, and the problem is solved by the scaling dot product model.

And S5, carrying out probability classification on the data obtained in the step S4 by using the BiGRU. In the present embodiment, the tags include 6 types of tags, which are "toxin", "segment _ toxin", "object", "thin", "insert", and "identity _ state", respectively.

The BiGRU (bidirectional gated recurrent neural network) is an improvement based on the BiLSTM (bidirectional long and short time memory network), and the LSTM module is replaced by the GRU, so that the training speed of the network is greatly improved under the condition of ensuring the classification accuracy. The GRU combines the hidden state and the cellular state in the LSTM into one state, thereby significantly shortening the training time and significantly increasing the training speed of large corpus texts. More specifically, after the GRU reads the word embedding vector and the hidden layer state vector, the output vector and the hidden layer state vector are generated through a gating calculation.

BilSTM is an extension of a common RNN (recurrent neural network), and the RNN is different from the common neural network in that a neuron receives not only the input of the current moment but also the output of the previous neuron, so that the problem that the preceding information needs to be considered in the text is solved. In practical applications, it is not enough to consider only the information of the preceding text, but the preceding text also needs to add the following information, and in order to solve this problem, a bidirectional rnn (birnn) is born. The BiRNN adds reverse operation on the basis of the ordinary RNN, namely, the input sequence is inverted and then calculated and output once, and the final result is the stacking of the forward RNN result and the reverse RNN result. Theoretically, the BiRNN can consider context information, but in practical application, it is found that the BiRNN is difficult to process information with long-term dependency, and a simple example is that when an english sentence is generated, if the sentence is long, when a predicate verb is generated, the BiRNN cannot remember a single-complex form of a subject and select an appropriate predicate verb. To address this problem, LSTM introduces gating mechanisms including a forgetting gate (forget gate), an input gate (input gate), and an output gate (output gate). The forgetting gate is used for controlling the passing proportion of the input information at the previous moment.

In the embodiment, the probability of each web text on different types of labels is output by adopting a full connection layer.

The data set is tested by using the multi-label text classification method based on BiGRU and attention mechanism, in the embodiment, the data set on Kaggle of famous data competition is used for testing, the data set is composed of comments on Wikipedia, and each comment is a piece of web text. In the present embodiment, the tags include 6 types of tags, which are "toxin", "segment _ toxin", "object", "thin", "insert", and "identity _ state", respectively. Each comment may have a plurality of tags or no tags, and the method of the present invention can give the probability of each comment on 6 types of tags, thereby realizing the classification of multi-tag texts.

As shown in fig. 1, in this embodiment, the multi-label text classification method of the present invention may be used for multi-label text classification based on the network architecture shown in fig. 1, where the network architecture includes: "Input layer", "embedded layer", "Self-attention layer", "BiGRU", "Output layer".

In this embodiment, an "Input layer" is used to Input a neuron, an "Embedding layer" is used to pre-train the neuron to shorten the training time and better represent the web text, a "Self-attentionlayer" is used to give a weight to a specific word vector to improve the classification accuracy, an "Output layer" replaces the common softmax with a full-connected layer, and each neuron outputs a value in the range of [0,1] to represent the classification accuracy of a specific class. In this embodiment, the "Input layer" inputs 200 neurons, that is, each comment takes the first 200 characters, and if less than 200 characters, the empty lattice is automatically filled. The number of the neurons processed by the "Embedding layer", the "Self-attention layer" and the "BiGRU" is respectively 100, 128 and 256, the "Output layer" is a full connection layer consisting of 6 neurons, and the probability of the comment of each neuron on the 6 types of tags is calculated.

In order to fully learn the characteristics of the web text, a way of pre-training word vectors is adopted in the "Embedding layer", in this embodiment, GloVe word vectors are used, and are implemented based on co-occurrence matrix decomposition, each neuron is represented by a 100-dimensional vector, and the shorter the distance between the vectors is, the higher the similarity of two neurons is. The GloVe word vector set was trained using a 60 billion single word (Token) corpus, containing 400K characters, provided by the university of stanford research team.

In addition, the "Self-attention layer" in this embodiment introduces Q, K and V query vector sequences, and uses a Self-attention model of a scaled dot product as an attention scoring function, so that different connected weights can be generated "dynamically", and the method can also be used for processing a variable-length information sequence.

As shown in FIG. 2, for the label classification result of the data set on Kaggle, it can be seen that the web text of the "toxic" category in the data set is the most. By adopting the multi-label text classification method to perform multi-label classification on the data set on Kaggle to obtain a data result and verifying the data result, the accuracy of the classification result of the multi-label text classification method based on BiGRU and attention mechanism can be found to exceed 98%.

In summary, the invention provides a method for classifying multi-label texts based on BiGRU and attention mechanism, which comprises the steps of obtaining a plurality of web texts; preprocessing a plurality of web texts; extracting deep information features of the web text by using the pre-training word vector; adding corresponding weight to the deep information characteristics according to an attention mechanism; performing label probability classification on the data of different categories by using a BiGRU; and outputting the probability of each piece of web text on different types of labels. The method adopts the pre-training word vectors and the BiGRU to greatly shorten the time required by training the neural network, meanwhile, the accuracy of classification of the method is improved by using the attention mechanism, compared with a baseline model BiLSTM, the accuracy is improved to a certain extent, and the wide applicability of the multi-label text classification method disclosed by the invention is also shown by the test of fusion of the BiGRU and the attention mechanism on a data set, so that the method can be applied and deployed in network text information related to the power grid.

While the present invention has been described in detail with reference to the preferred embodiments thereof, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A multi-label text classification method based on BiGRU and attention mechanism is characterized by comprising the following steps:

s1, acquiring a plurality of web texts;

s2, preprocessing a plurality of web texts;

2. The method for multi-label text classification based on BiGRU and attention mechanism according to claim 1, wherein the step S2 is specifically:

3. The method for multi-label text classification based on BiGRU and attention mechanism according to claim 2, wherein the step S3 is specifically:

4. The BiGRU and attention mechanism based multi-label text classification method of claim 3,

the shorter the distance between two word vectors, the higher the similarity of the two word vectors.

5. The method for multi-label text classification based on BiGRU and attention mechanism according to claim 3 or 4, wherein the step S4 specifically includes:

the data obtained in step S3 is weighted accordingly using the attention mechanism that introduces the query vector, calculates the direct correlation between the query vector and the input vector, i.e., the word vector, by the attention scoring function, and introduces the attention variable representing the selected index position.

6. The BiGRU and attention mechanism based multi-label text classification method of claim 1,

the step S5 includes a category 6 tag.

7. The BiGRU and attention mechanism-based multi-label text classification method of claim 6, wherein the label comprises:

8. The method for multi-label text classification based on BiGRU and attention mechanism according to claim 1, characterized in that in step S6,