CN111078833B - Text classification method based on neural network - Google Patents

Text classification method based on neural network Download PDF

Info

Publication number
CN111078833B
CN111078833B CN201911223541.XA CN201911223541A CN111078833B CN 111078833 B CN111078833 B CN 111078833B CN 201911223541 A CN201911223541 A CN 201911223541A CN 111078833 B CN111078833 B CN 111078833B
Authority
CN
China
Prior art keywords
word
text
information
level
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911223541.XA
Other languages
Chinese (zh)
Other versions
CN111078833A (en
Inventor
黄少滨
吴汉瑜
李熔盛
申林山
姜梦奇
范贺添
谷虹润
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201911223541.XA priority Critical patent/CN111078833B/en
Publication of CN111078833A publication Critical patent/CN111078833A/en
Application granted granted Critical
Publication of CN111078833B publication Critical patent/CN111078833B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of text classification, and particularly relates to a text classification method based on a neural network. The invention can extract semantic information and structural information of different levels of the text, including word-level semantic information, word-level structural information, phrase-level semantic information and phrase-level structural information. In order to obtain the final representation of the text, the invention further provides two fusion methods to fuse four kinds of information, namely static fusion and dynamic fusion based on attention mechanism. The invention is based on the neural network, comprehensively utilizes the semantic information and the structural information of different levels of the text, and improves the accuracy of text classification.

Description

Text classification method based on neural network
Technical Field
The invention belongs to the technical field of text classification, and particularly relates to a text classification method based on a neural network.
Background
Text classification is an important component of many natural language processing tasks and can be applied to sentiment classification, question classification, and web page retrieval, and text representation plays an important role in text classification. Early text classification techniques were mostly based on traditional machine learning algorithms, such as naive bayes, support vector machines, etc. This method often requires domain experts to manually design and extract features in the text, which is time-consuming and labor-intensive. In recent years, models of neural networks based on deep learning have demonstrated powerful performance in many tasks in the field of natural language processing, such as machine translation, emotion analysis, text classification. Most neural network models are based on CNN, RNN or attention mechanisms.
The Convolutional Neural Network (CNN) can be used for modeling texts, ngram information of the texts can be extracted through sliding windows, most discriminative words or phrases in the texts can be selected through a maximum pooling technology, however, how to select the size of the window is an important problem, structural information is lost when the window is too small, parameters are too many when the window is too large, and troubles are brought to training.
Recursive neural networks (recursivenn) model text by tree structures, can effectively capture structural information of text, and have proven effective in constructing text representations. However, the performance of the recurrent neural network depends to a large extent on the performance of the construction of the text tree, and it is very time-consuming to construct the text tree, and the relationship of sentences in the text is difficult to model by the tree structure, so it also cannot make good use of semantic information and structural information.
Unlike Recurrent neural networks, Recurrent neural networks (recurrentnn) are a sequential model that inherently fits the modeling of text, which captures the structural information of the text, but which is a biased model in which later words are more dominant than earlier words in the text.
Attention (Attention) mechanisms have been applied to many natural language processing tasks with great success and have proven effective in capturing text semantics. The method can learn the contribution ratio of each part of information in the text to the whole semantic information of the text through a small number of parameters, important words or phrases are assigned with higher weights, but word sequence information is ignored, and therefore structural information of the text cannot be well utilized.
In recent years, models of neural networks based on deep learning have demonstrated powerful performance in many tasks in the field of natural language processing, such as machine translation, emotion analysis, text classification. Most neural network models are based on Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), or Attention (Attention) mechanisms.
CNN-based models
Convolutional Neural Networks (CNNs) were introduced by some researchers from the computer vision field to the natural language processing field with great success. Kim proposes to extract the features of the text by using a plurality of convolution kernels with different sizes to classify sentences, and Kalchbrenner et al combines a dynamic k-max pooling mechanism with CNN to achieve good effect in sentence modeling. Zhang et al proposed a character-level convolutional neural network model for use in text classification. Because shallow CNN does not handle long-range dependencies in sentences well, some deep CNN models have been proposed, such as Very Deep CNN (VDCNN) by Conneau et al and deep pyramid CNN by Johnson et al.
RNN-based models
Recurrent neural network (Recurrent NN) is a sequence model, widely used in the field of natural language processing. Tang et al use gated recurrent neural networks for emotion classification. Some researchers have attempted to modify the structure of RNN, Wang proposed using Disconnected RNN for text classification, and Yu et al similarly proposed modeling sentences with Sliced RNN and achieved good results.
Model based on attention mechanism
Bahdana et al first applied the attention mechanism in machine translation. Yang et al use a hierarchical attention network and bidirectional GRUs to model and classify documents. Vaswani et al proposed a Transformer, a model based entirely on self attention mechanism, with significant success in machine translation. Lin et al propose a structured self-attention sentence embedding.
Text classification is the basis for many natural language processing tasks, and text representation is the key to text classification. The text representation can be understood as a high-level feature of the text, and the performance of text classification is directly influenced by the quality of the text representation. The traditional text representation method cannot well represent the text, such as a bag-of-words model, which represents each word as a high-dimensional sparse vector, but ignores the sequence information of the words in the text and the semantic information of the words. In recent years, with the development of deep learning, most of the current good text classification models are based on neural networks, represent texts into low-dimensional real-valued vectors, and then feed the vectors into a softmax function to predict the probability of each category, but the text classification models cannot make good use of semantic information and structural information of the texts.
Disclosure of Invention
The invention aims to provide a text classification method based on a neural network, aiming at the problem that the traditional neural network model cannot effectively utilize semantic information and structural information of a text.
The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:
step 1: inputting a text to be classified, preprocessing the text to obtain a word direction corresponding to each word in the textQuantity xi
Step 2: according to the corresponding word vector x of each wordiActing directly on the word vector x using the attention-machine mechanismiTo obtain word level semantic information Iwse(ii) a Direct action on word vector x using a bidirectional LSTM networkiObtaining word level structure information Iwst
And step 3: acting on word vectors x using convolutional neural networksiObtaining phrase information D;
and 4, step 4: obtaining phrase-level semantic information I using an attention mechanism on phrase information Dpse(ii) a Acting on phrase information D by using bidirectional LSTM network to obtain phrase level structure information Ipst
And 5: fused word-level semantic information IwseWord level structural information IwstSemantic information at the phrase level IpseAnd phrase level structural information IpstTo obtain the final text vector representation IT
Step 6: vector representation I of the final textTInputting the probability into a softmax classifier to obtain the probability corresponding to each class: the category with the highest probability is taken as the category to which the text belongs;
p=softmax(WcIT+bc)
wherein WcIs the weight of the softmax classifier, bcIs the corresponding offset.
The present invention may further comprise: the step 1 of preprocessing the text specifically comprises the following steps:
step 1.1: detecting the length of the input text; if the length of the input text is greater than the specified length, the text is cut off; if the length of the input text is smaller than the specified length, filling the text;
step 1.2: performing word segmentation on the text, indexing words according to word frequency, and converting the text into a corresponding index sequence;
step 1.3: and converting each index in the index sequence into a word vector of a word corresponding to the index sequence to finish the preprocessing of the text.
SaidObtaining word level semantic information I in step 2wseThe method comprises the following steps: let the input sentence be w1,w2,w3,...,wsThe corresponding word vectors are x respectively1,x2,x3,...,xs(ii) a Since each word in a sentence contributes differently to the overall semantic information of the sentence, the attention mechanism is used to directly act on the word vector to learn the proportion alpha of each word contributing to the word-level semantic informationi(ii) a Vector the word x of each wordiMultiplying the contribution proportion alpha with the corresponding contribution proportion alpha and accumulating to obtain word-level semantic information Iwse
Figure BDA0002301522370000031
Wherein the content of the first and second substances,
Figure BDA0002301522370000038
is the word wiD is the dimension of the vector;
Figure BDA0002301522370000032
ui=tanh(Wwxi+bw)
wherein, tanh is an activation function,
Figure BDA0002301522370000039
is uiTranspose of (W)w,bw,uwIs a parameter of the attention mechanism;
obtaining word level structure information I in step 2wstThe method comprises the following steps: the word level structure information IwstIs the final state of forward LSTM
Figure BDA0002301522370000033
And the final state of the inverse LSTM
Figure BDA0002301522370000034
Are connected to form the product;
Figure BDA0002301522370000035
Figure BDA0002301522370000036
Figure BDA0002301522370000037
the semantic information I of fused word level in the step 5wseWord level structural information IwstSemantic information at the phrase level IpseAnd phrase level structural information IpstTo obtain the final text vector representation ITThe method comprises the following steps: static fusion is adopted, namely the text representation is weighted average of word-level semantic information, word-level structure information, phrase-level semantic information and phrase-level structure information;
IT=(Iwse+Iwst+Ipse+Ipst)/4。
the semantic information I of fused word level in the step 5wseWord level structural information IwstSemantic information at the phrase level IpseAnd phrase level structural information IpstTo obtain the final text vector representation ITThe method comprises the following steps: the attention mechanism is applied to four different information to automatically learn the vector representation I of each part of information to the final text by adopting dynamic fusion based on the attention mechanismTIs given here as Iwse,Iwst,Ipse,IpstAre respectively I1,I2,I3,I4
Figure BDA0002301522370000041
Figure BDA0002301522370000042
ui=tanh(WtIi+bt)
Wherein, tanh is an activation function,
Figure BDA0002301522370000043
is uiTranspose of (W)t,bt,utIs a parameter of the attention mechanism.
The invention has the beneficial effects that:
the invention provides a text classification method based on a neural network, which aims to solve the problem that the traditional text classification method cannot simultaneously and effectively utilize semantic information and structural information of a text. In order to obtain the final representation of the text, the invention further provides two fusion methods to fuse four kinds of information, namely static fusion and dynamic fusion based on an attention mechanism. The invention is based on the neural network, comprehensively utilizes the semantic information and the structural information of different levels of the text, and improves the accuracy of text classification.
Drawings
Fig. 1 is an overall architecture diagram of the present invention.
FIG. 2 is a schematic diagram of the static fusion of the present invention.
FIG. 3 is a schematic diagram of dynamic fusion according to the present invention.
FIG. 4 is a diagram of obtaining word-level semantic information I using an attention mechanismwseVisualization experiment result chart of (1).
FIG. 5 is a diagram of obtaining phrase-level semantic information I using an attention mechanismpseVisualization experiment result chart of (1).
Fig. 6 is an overall flow chart of the present invention.
FIG. 7 is a table of experimental data in an example of the present invention.
FIG. 8 is a sample analysis table according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Text classification is the basis for many natural language processing tasks, and text representation is the key to text classification. The text representation can be understood as a high-level feature of the text, and the performance of text classification is directly influenced by the quality of the text representation. The traditional text representation method cannot well represent the text, such as a bag-of-words model, which represents each word as a high-dimensional sparse vector, but ignores the sequence information of the words in the text and the semantic information of the words. In recent years, with the development of deep learning, most of the current good text classification models are based on neural networks, represent texts into low-dimensional real-valued vectors, and then feed the vectors into a softmax function to predict the probability of each category, but the text classification models cannot make good use of semantic information and structural information of the texts. The model provided by the invention is also based on the neural network, but can comprehensively utilize semantic information and structural information of different levels of the text, thereby improving the accuracy of text classification.
Aiming at the fact that the semantic information and the structural information of a text cannot be effectively utilized by a traditional neural network model, the invention aims to design a novel text classification model based on a neural network, the model can extract the semantic information and the structural information of the text at different levels, including word level semantic information, word level structural information, phrase level semantic information and phrase level structural information, then the four parts of information are fused by using the fusion method provided by the invention to form the representation of the text, and finally the representation of the text is input into a softmax function for classification.
A text classification method based on a neural network comprises the following steps:
step 1: inputting a text to be classified, preprocessing the text to obtain a word vector x corresponding to each word in the texti
Step 2: according to the corresponding word vector x of each wordiActing directly on the word vector x using the attention-machine mechanismiTo obtain word level semantic information Iwse(ii) a Direct action on word vector x using a bidirectional LSTM networkiObtaining word level structure information Iwst
And step 3: acting on word vectors x using convolutional neural networksiObtaining phrase information D;
and 4, step 4: obtaining phrase-level semantic information I using an attention mechanism on phrase information Dpse(ii) a Acting on phrase information D by using bidirectional LSTM network to obtain phrase level structure information Ipst
And 5: fused word-level semantic information IwseWord level structural information IwstSemantic information at the phrase level IpseAnd phrase level structural information IpstTo obtain the final text vector representation IT
Step 6: vector representation I of the final textTInputting the probability into a softmax classifier to obtain the probability corresponding to each class: the category with the highest probability is taken as the category to which the text belongs;
p=softmax(WcIT+bc)
wherein WcIs the weight of the softmax classifier, bcIs the corresponding offset.
The preprocessing of the text in the step 1 specifically comprises the following steps:
step 1.1: detecting the length of the input text; if the length of the input text is greater than the specified length, the text is cut off; if the length of the input text is smaller than the specified length, filling the text;
step 1.2: performing word segmentation on the text, indexing words according to word frequency, and converting the text into a corresponding index sequence;
step 1.3: and converting each index in the index sequence into a word vector of a word corresponding to the index sequence to finish the preprocessing of the text.
Obtaining word level semantic information I in step 2wseThe method comprises the following steps: let the input sentence be w1,w2,w3,...,wsThe corresponding word vectors are x respectively1,x2,x3,...,xs(ii) a Since each word in a sentence contributes differently to the overall semantic information of the sentence, the attention mechanism is used to directly act on the word vector to learn the proportion alpha of each word contributing to the word-level semantic informationi(ii) a Vector the word x of each wordiMultiplying the contribution proportion alpha with the corresponding contribution proportion alpha and accumulating to obtain word-level semantic information Iwse
Figure BDA0002301522370000061
Wherein the content of the first and second substances,
Figure BDA0002301522370000062
is the word wiD is the dimension of the vector;
Figure BDA0002301522370000063
ui=tanh(Wwxi+bw)
wherein, tanh is an activation function,
Figure BDA0002301522370000064
is uiTranspose of (W)w,bw,uwIs a parameter of the attention mechanism;
obtaining word level structure information I in step 2wstThe method comprises the following steps: the word level structure information IwstIs the final state of forward LSTM
Figure BDA0002301522370000065
And the final state of the inverse LSTM
Figure BDA0002301522370000066
Are connected to form the product;
Figure BDA0002301522370000067
Figure BDA0002301522370000068
Figure BDA0002301522370000069
fusing word-level semantic information I in step 5wseWord level structural information IwstSemantic information at the phrase level IpseAnd phrase level structural information IpstTo obtain the final text vector representation ITThe method comprises the following steps: static fusion is adopted, namely text expression is weighted average of word-level semantic information, word-level structure information, phrase-level semantic information and phrase-level structure information;
IT=(Iwse+Iwst+Ipse+Ipst)/4。
fusing word-level semantic information I in step 5wseWord level structural information IwstSemantic information at the phrase level IpseAnd phrase level structural information IpstTo obtain the final text vector representation ITThe method comprises the following steps: the attention mechanism is applied to four different information to automatically learn the vector representation I of each part of information to the final text by adopting dynamic fusion based on the attention mechanismTIs given here as Iwse,Iwst,Ipse,IpstAre respectively I1,I2,I3,I4
Figure BDA0002301522370000071
Figure BDA0002301522370000072
ui=tanh(WtIi+bt)
Wherein, tanh is an activation function,
Figure BDA0002301522370000073
is uiTranspose of (W)t,bt,utIs a parameter of the attention mechanism.
The invention can be summarized as follows:
1) and preprocessing the text corpus and acquiring word-level semantic information and word-level structure information.
2) And obtaining phrase level semantic information and phrase level structure information.
3) And fusing the word level semantic information, the word level structure information, the phrase level semantic information and the phrase level structure information to obtain the vector representation of the final text for text classification.
For the acquisition of word-level semantic information, the method directly acts on input word vectors by using an attention mechanism to obtain the contribution ratio of each word to the word-level semantic information, and then multiplies and accumulates the contribution ratio and the corresponding word vectors to obtain the word-level semantic information; for the acquisition of word-level structure information, the invention uses a bidirectional LSTM network to directly act on word vectors, and the word-level structure information is formed by connecting the final state of a forward LSTM and the final state of a reverse LSTM.
For the acquisition of phrase-level semantic information, the method firstly uses a convolutional neural network to act on word vectors to obtain phrase information, and then uses an attention mechanism to act on the phrase information to obtain phrase-level semantic information; for the acquisition of phrase level structure information, the present invention uses bi-directional LSTM to act on the phrase information, and the phrase level structure information is formed by connecting the final state of forward LSTM with the final state of backward LSTM.
For the fusion mode of word level semantic information, word level structure information, phrase level semantic information and phrase level structure information, the invention provides two fusion modes: static fusion (i.e., weighted average of the four pieces of information) and dynamic fusion based on the attention mechanism (i.e., learning the proportion of the contribution of the four pieces of information to the overall text representation using the attention mechanism, then multiplying and accumulating)
Example 1:
(1) the input of the invention is a text, which is composed of a series of words, and the word vector corresponding to each word in the input text is obtained by searching the 300-dimensional GloVe pre-training word vector as the input of the neural network.
(2) Using an attention machine to act on word vectors to obtain the contribution proportion of each word to word-level semantic information, and then multiplying and accumulating the contribution proportion of each word and the corresponding word vector to obtain word-level semantic information; and (3) using the bidirectional LSTM to act on the word vector, and splicing the final state of the forward LSTM and the final state of the reverse LSTM to obtain word-level structural information.
(3) Using a convolutional neural network to act on word vectors to obtain hidden representations of the phrases, using self-attention to act on the hidden representations of the phrases to obtain the contribution proportion of each phrase to phrase-level semantic information, and then multiplying and accumulating the contribution proportion of each phrase and the corresponding phrase hidden representations to obtain phrase-level semantic information; the phrase level structure information is derived using bi-directional LSTM to act on the hidden representation of the phrase.
(4) And obtaining final text representation by using the static fusion method or the dynamic fusion method based on the attention mechanism on the word-level semantic information, the word-level structure information, the phrase-level semantic information and the phrase-level structure information, and then sending the text representation serving as the high-level features of the text into the category to which the softmax function predicts the text.
1. Preprocessing text
Firstly, the text is subjected to word segmentation, and an NLTK word segmentation device is adopted as a word segmentation tool. The words are then indexed by word frequency, starting with index 1, and the text is converted into a corresponding index sequence. The predefined model needs to have the input of a fixed length, so the input text is processed, if the length of the input text is greater than the specified length, the text is cut off, and if the length of the input text is less than the specified length, the text is filled in, wherein the filling mode is that 0 is supplemented in the front. After converting an input text into an index sequence, converting each index into a word vector of a corresponding word by searching 300-dimensional GloVe pre-training word vectors, initializing the word vectors of the words which are not in the GloVe by adopting random uniform distribution, and taking the converted word vectors as the input of a neural network.
2. Acquisition of word-level information
Let the input sentence with length s be w1,w2,w3,...,wsThe corresponding word vectors are x respectively1,x2,x3,...,xsWherein
Figure BDA0002301522370000081
Is the word wiD is the dimension of the vector. Because each word in the sentence has different integral semantic contributions to the sentence, an attention mechanism is directly acted on the word vector to learn the proportion alpha of each word contributing to the word-level semantic information, and then the word vector x of each word is multiplied by the corresponding contribution proportion alpha and accumulated to obtain the word-level semantic information IwseNamely:
ui=tanh(Wwxi+bw)
Figure BDA0002301522370000082
Figure BDA0002301522370000083
where tanh is the activation function and,
Figure BDA0002301522370000084
is uiTranspose of (W)w,bw,uwIs a parameter of the attention mechanism.
For word level structural information IwstObtained using bi-directional LSTM, i.e.:
Figure BDA0002301522370000091
Figure BDA0002301522370000092
Figure BDA0002301522370000093
word level structural information IwstIs the final state of forward LSTM
Figure BDA0002301522370000094
And the final state of the inverse LSTM
Figure BDA0002301522370000095
And connecting to form the product.
The word vector is 300 dimensions, the word-level semantic information is 300 dimensions, the hidden state dimensions of the forward LSTM and the reverse LSTM are both 150 dimensions, and the word-level structural information is the concatenation of the two states, so that the word-level structural information is 300 dimensions.
3. Phrase-level information acquisition
Since the convolutional neural network can extract n-gram features of the sentence, the window size of the convolutional neural network is set to be n to extract phrase information with the length of n in the sentence. The phrase information with the length of 3, 4 and 5 in the input text is extracted by using 100 convolution kernels with the window size of 3, 4 and 5 respectively, and then the phrase information is spliced to obtain the phrase information. Let the convolved output be d1,d2,d3,...,ds. Since each phrase in the sentence has different overall semantic contributions to the sentence, the attention mechanism is used on the phrase-level representation to learn the proportion beta of each phrase contributing to the phrase-level semantic information, and then the hidden representation vector d of each phrase is multiplied by the contribution proportion beta of each phrase and accumulated to obtain the word-level semantic information IpseThe method is similar to the acquisition of word-level semantic information.
For phrase level structural information IpstBi-directional LSTM is used for acquisition, similar to the acquisition of word-level structural information.
For the phrase information, 100 convolution kernels each having a window size of 3, 4, or 5 are used, and thus the dimension of the spliced phrase information is 300 dimensions. The dimension of the semantic information at the phrase level after attention mechanism extraction is also 300 dimensions. For phrase level structure information, the same bi-directional LSTM structure as the extracted word level structure information is used, wherein the forward LSTM and backward LSTM have 150 dimensions, and the phrase level structure information is the concatenation of their final states, so the dimension is 300 dimensions.
4. Fusion method and classification
For the obtained word-level semantic information IwseWord level structural information IwstSemantic information at the phrase level IpsePhrase level structural information IpstThe invention proposes two different fusion strategies to fuse them together to obtain the final textual representation: static fusion and attention-based dynamic fusion.
For static fusion, as shown in FIG. 2, the text representation is composed of word-level semantic information, word-level structure information, phrase-level semantic information, weighted average of phrase-level structure information, i.e., representation of text T
IT=(Iwse+Iwst+Ipse+Ipst)/4
For dynamic fusion, as shown in FIG. 3, the attention mechanism is applied to four different pieces of information to automatically learn the contribution ratio γ of each piece of information to the final text representation. Here is set as Iwse,Iwst,Ipse,IpstAre respectively I1,I2,I3,I4The expression calculation formula of the text T is as follows:
ui=tanh(WtIi+bt)
Figure BDA0002301522370000101
Figure BDA0002301522370000102
this results in a text representation ITDue to word-level semantic information IwseWord level structural information IwstSemantic information at the phrase level IpsePhrase level structural information IpstAll have 300 dimensions, so the representation of the final text, i.e. the high-level features of the text, also has 300 dimensions.
The text is then represented as a vector ITSending to a softmax classifier to obtain the probability corresponding to each category:
p=softmax(WcIT+bc)
wherein WcIs the weight of the softmax classifier, bcIs the corresponding offset.
To obtain the parameters of the model, the cross-entropy loss function is minimized as follows:
Figure BDA0002301522370000103
where N is the number of samples in the dataset, C is the number of classes, yijIs the true value of the ith sample in the jth class, pijIs the predicted probability value of the neural network for the ith sample in the jth class. For the training of model parameters, an Adam optimizer is used, the advantages of two optimization algorithms of AdaGrad and RMSProp are combined, the first moment estimation and the second moment estimation of the gradient are comprehensively considered, and the updating step length is calculated. The method can automatically adjust the learning rate and has the characteristics of simplicity and effectiveness.
After the model parameters are trained, the model is saved. When the texts outside the corpus need to be classified, the texts can be preprocessed firstly, then a model is loaded, word-level semantic information, word-level structure information, phrase-level semantic information and phrase-level structure information are respectively calculated, and then a static fusion method or a dynamic fusion method based on an attention mechanism is used for fusing the four kinds of information to obtain the representation of the final text. And finally, the text expression vector is sent to a softmax function to calculate the probability of each category, and the category with the highest probability is the category to which the text belongs.
5. Experiment of the invention
In order to prove that the model effect provided by the invention is superior to other models, the model is compared with other baseline models on a plurality of public text classification data sets, and the evaluation index is the classification accuracy.
Data set presentation used for experiments:
the MR data set is a two-category movie review data set published by Pang et al, consisting of 5331 positive samples and 5331 negative samples.
The SUBJ dataset is a binary dataset published by Pang et al, and all sentences in the dataset are divided into objective and objective.
The TREC data set is a six-classification problem classification data set issued by Li et al, and the sample labels in the data set are abbrevation, entry, description, location, numeric and human respectively.
The CR dataset is a binary dataset published by Hu et al containing customer reviews, whose labels are positive and negative, respectively.
The Stanford sentment Treebank dataset is a five-category movie review dataset published by Socher et al, whose labels consist of very negative, neutral, positive, very positive.
The AGNews dataset is a news classification dataset issued by Zhang et al, and labels of the AGNews dataset are World, Sports, Business, Sci/Tech respectively.
The experimental setup was as follows:
all experiments were performed on a Windows system using the deep learning framework Keras. For initialization of word vectors, the input to the neural network is initialized with 300 dimensional GloVe word vectors, and for words not in GloVe, their word vectors are initialized with a uniform distribution. The initialization of other weights of the model adopts Xavier uniform distribution, the initialization of bias is 0, the hidden state dimensions of the bidirectional LSTM are all 150, and 100 convolution kernels with window sizes of 3, 4 and 5 are used respectively. For the activation function, a Linear modified Units (Rectified Linear Units) ReLU activation function is applied to the convolutional layer, and the activation function of the fully-connected layer is tanh. For regularization, dropout is used to apply after the Embedding layer, after the convolution layer, and after the fully connected layer, respectively. In addition, no further regularization term is introduced. For model optimization, an Adam optimizer was used to minimize the loss, with the learning rate set at 1 e-4. For model training, set the size of each batch to 32, epoch (total round) to 20, and accuracy on the validation set begins to decline using EarlyStoping.
The results of the experiment are shown in FIG. 7:
all models are divided into 6 parts, the first part is a CNN-based model, the second part is an RNN-based model, the third part is a reinforcement learning-based model, the fourth part is a capsule neural network-based model, the fifth part is an attention-based model, and the last part is the model proposed by the invention.
Compared with other models, the dynamic model provided by the invention achieves the highest performance on four data sets in six classified data sets of the published texts, wherein the MR data set (with the accuracy rate of 83.4) and the CR data set (with the accuracy rate of 87.0) are greatly improved compared with other models. The static model proposed by the present invention also achieves competitive results compared to other models. Compared to CNN-based models and RNN-based models and Attention-based models, dynamic models go well beyond them on six datasets. The reinforcement learning based model and the capsule network based model achieved the highest accuracy on SST5 and AGNews datasets, respectively, but the model also achieved comparable results on both datasets. This shows that the model can effectively extract the features of the text and has strong generalization capability.
Compared with other models, the most important difference is that the model can extract semantic information and structural information of different levels and fuse the semantic information and the structural information to obtain a text representation, and other models only learn a small amount of semantic information or only learn a small amount of structural information and cannot combine the semantic information and the structural information. The main reason that the model can obtain the best performance is that the model can extract word-level semantic information and structure information, phrase-level semantic information and structure information of the text, and the dynamic combination method based on the Attention mechanism can dynamically adjust the weights of the four parts of information to form the final text representation.
In order to prove that the model provided by the invention can extract word-level semantic information and phrase-level semantic information, visual experiments are carried out on some samples. For word-level semantic information, the attention mechanism may learn the proportion of each word's contribution to word-level semantics. As shown in FIG. 4, the sample "a sample open human heart together by means of skewed em element" is taken from the MR data set and the class label is Positive. It can be seen that the key words "pleasant" and "killed" are assigned higher weights by the attention mechanism, and word-level semantic information is learned.
The phrase-level semantic information is similar to the word-level semantic information, as shown in FIG. 5, the sample "it's not differential to spot the custom early-on in this predicted able threshold" is taken from the MR dataset, with the class label Negative. It is difficult to find words with Negative emotion from sentences, but the phrase-level semantic information still learns key phrases such as "this predictive threshold" and assigns higher weights.
To investigate why the dynamic model proposed by the present invention can achieve the best performance on four of the six data sets, we chose some samples to analyze, as shown in fig. 8. Wherein, AttwseAttention value, Att, representing word-level semantic informationpseAttention value, Att, representing phrase-level semantic informationwstAttention value, Att, representing word-level structural informationpstAn attention value representing phrase level structure information.
For a movie review MR data set including "a thughtful, provocative and instant humanizing file", semantic information of words such as "thughtful", "provocative" and "humanizing" can be extracted by the model, and the semantic information is assigned with higher weight to word-level semantic information, so that the semantic information is classified as positive.
For movie review MR data set "i didn't laugh, i didn't smile, i survived", although attention mechanism may focus on the word "didn't", the sentence also contains many words such as "laugh" and "smile", if only word-level semantic information is considered, which may cause misclassification, at this time, the model may extract semantic information of the phrases such as "didn't laugh" and "didn't smile", and attention mechanism assigns higher weight to the phrase-level semantic information, so that the classification is negative.
For the "nice machines, but i connector user quality preliminary low now" in the CR dataset, the word-level structure information "nice … … but … … low" is learned and is therefore correctly classified as negative.
For the TREC problem, "What type of the curve is used in Australia? "if only the semantic information is focused on, it may cause classification errors, because the word" Australia "may make the model give higher weight to the location class, and the model can learn the phrase-level structure information of" what type of … … ", so the classification as entity
The invention provides a new neural network model for text classification. In order to solve the problem that the traditional text classification method cannot simultaneously and effectively utilize semantic information and structural information of a text, the model provided by the invention can extract the semantic information and the structural information of the text at different levels, including word level semantic information, word level structural information, phrase level semantic information and phrase level structural information. The model takes a text as an input and outputs a category to which the text predicted by the model belongs. In order to obtain the final representation of the text, the invention further provides two fusion methods to fuse four kinds of information, namely static fusion and dynamic fusion based on an attention mechanism. Compared with the traditional method, the text classification model provided by the invention can utilize more text information, and experiments prove that the method has higher performance on a plurality of public text classification data sets than the traditional text classification model.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A text classification method based on a neural network is characterized by comprising the following steps:
step 1: inputting a text to be classified, preprocessing the text to obtain a word vector x corresponding to each word in the texti
Step 2: according to the corresponding word vector x of each wordiActing directly on the word vector x using the attention-machine mechanismiTo obtain word level semantic information Iwse(ii) a Direct action on word vector x using a bidirectional LSTM networkiObtaining word level structure information Iwst
And step 3: acting on word vectors x using convolutional neural networksiObtaining phrase information D;
and 4, step 4: obtaining phrase-level semantic information I using an attention mechanism on phrase information Dpse(ii) a Acting on the phrase information D by using a bidirectional LSTM network to obtain phrase level structure information Ipst
And 5: fused word-level semantic information IwseWord level structural information IwstSemantic information at the phrase level IpseAnd phrase level structural information IpstTo obtain the final text vector representation IT
Step 6: vector representation I of the final textTInputting the probability into a softmax classifier to obtain the probability corresponding to each class: taking the category with the highest probability as the category to which the text belongs;
p=softmax(WcIT+bc)
wherein WcIs the weight of the softmax classifier, bcIs the corresponding offset.
2. The neural network-based text classification method according to claim 1, characterized in that: the step 1 of preprocessing the text specifically comprises the following steps:
step 1.1: detecting the length of the input text; if the length of the input text is greater than the specified length, the text is cut off; if the length of the input text is smaller than the specified length, filling the text;
step 1.2: performing word segmentation on the text, indexing words according to word frequency, and converting the text into a corresponding index sequence;
step 1.3: and converting each index in the index sequence into a word vector of a word corresponding to the index sequence to finish the preprocessing of the text.
3. The neural-network-based text classification method according to claim 1 or 2, characterized in that: obtaining word level semantic information I in step 2wseThe method comprises the following steps: let the input sentence be w1,w2,w3,…,wsThe corresponding word vectors are x respectively1,x2,x3,…,xs(ii) a Since each word in a sentence contributes differently to the overall semantic information of the sentence, the attention mechanism is used to directly act on the word vector to learn the proportion alpha of each word contributing to the word-level semantic informationi(ii) a Vector the word x of each wordiMultiplying the contribution proportion alpha with the corresponding contribution proportion alpha and accumulating to obtain word-level semantic information Iwse
Figure FDA0002301522360000011
Wherein the content of the first and second substances,
Figure FDA0002301522360000012
is the word wiD is the dimension of the vector;
Figure FDA0002301522360000013
ui=tanh(Wwxi+bw)
wherein, tanh is an activation function,
Figure FDA0002301522360000021
is uiTranspose of (W)w,bw,uwIs a parameter of the attention mechanism;
obtaining word level structure information I in step 2wstThe method comprises the following steps: the word level structure information IwstIs the final state of forward LSTM
Figure FDA0002301522360000022
And the final state of the inverse LSTM
Figure FDA0002301522360000023
Are connected to form the composite material;
Figure FDA0002301522360000024
Figure FDA0002301522360000025
Figure FDA0002301522360000026
4. the neural-network-based text classification method according to claim 1 or 2, characterized in that: the semantic information I of fused word level in the step 5wstWord, wordLevel Structure information IwstSemantic information at the phrase level IpseAnd phrase level structural information IpseTo obtain a vector representation I of the final textTThe method comprises the following steps: static fusion is adopted, namely text expression is weighted average of word-level semantic information, word-level structure information, phrase-level semantic information and phrase-level structure information;
IT=(Iwse+Iwst+Ipse+Ipst)/4。
5. the neural network-based text classification method according to claim 3, characterized in that: the semantic information I of fused word level in the step 5wseWord level structural information IwstSemantic information at the phrase level IpseAnd phrase level structural information IpstTo obtain a vector representation I of the final textTThe method comprises the following steps: static fusion is adopted, namely the text representation is weighted average of word-level semantic information, word-level structure information, phrase-level semantic information and phrase-level structure information;
IT=(Iwse+Iwst+Ipse+Ipst)/4。
6. the neural-network-based text classification method according to claim 1 or 2, characterized in that: the semantic information I of fused word level in the step 5wseWord level structural information IwstSemantic information at the phrase level IpseAnd phrase level structural information IpstTo obtain the final text vector representation ITThe method comprises the following steps: the attention mechanism is applied to four different information to automatically learn the vector representation I of each part of information to the final text by adopting dynamic fusion based on the attention mechanismTIs given here as Iwse,Iwst,Ipse,IpstAre respectively I1,I2,I3,I4
Figure FDA0002301522360000027
Figure FDA0002301522360000028
ui=tanh(WtIi+bt)
Wherein, tanh is an activation function,
Figure FDA0002301522360000031
is uiTranspose of (W)t,bt,utIs a parameter of the attention mechanism.
7. The neural network-based text classification method according to claim 3, characterized in that: the semantic information I of fused word level in the step 5wseWord level structural information IwstSemantic information at the phrase level IpseAnd phrase level structural information IpstTo obtain the final text vector representation ITThe method comprises the following steps: the attention mechanism is applied to four different information to automatically learn the vector representation I of each part of information to the final text by adopting dynamic fusion based on the attention mechanismTIs given here as Iwse,Iwst,Ipse,IpstAre respectively I1,I2,I3,I4
Figure FDA0002301522360000032
Figure FDA0002301522360000033
ui=tanh(WtIi+bt)
Wherein, tanh is an activation function,
Figure FDA0002301522360000034
is uiTranspose of (W)t,bt,utIs a parameter of the attention mechanism.
CN201911223541.XA 2019-12-03 2019-12-03 Text classification method based on neural network Active CN111078833B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911223541.XA CN111078833B (en) 2019-12-03 2019-12-03 Text classification method based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911223541.XA CN111078833B (en) 2019-12-03 2019-12-03 Text classification method based on neural network

Publications (2)

Publication Number Publication Date
CN111078833A CN111078833A (en) 2020-04-28
CN111078833B true CN111078833B (en) 2022-05-20

Family

ID=70312658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911223541.XA Active CN111078833B (en) 2019-12-03 2019-12-03 Text classification method based on neural network

Country Status (1)

Country Link
CN (1) CN111078833B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112231477B (en) * 2020-10-20 2023-09-22 淮阴工学院 Text classification method based on improved capsule network
CN112131391B (en) * 2020-11-25 2021-09-17 江苏电力信息技术有限公司 Power supply service client appeal text classification method based on capsule network
CN113157919B (en) * 2021-04-07 2023-04-25 山东师范大学 Sentence text aspect-level emotion classification method and sentence text aspect-level emotion classification system
CN113033218B (en) * 2021-04-16 2023-08-15 沈阳雅译网络技术有限公司 Machine translation quality evaluation method based on neural network structure search
CN113297364B (en) * 2021-06-07 2023-06-09 吉林大学 Natural language understanding method and device in dialogue-oriented system
CN113779192A (en) * 2021-08-23 2021-12-10 河海大学 Text classification algorithm of bidirectional dynamic route based on labeled constraint
CN113869065B (en) * 2021-10-15 2024-04-12 梧州学院 Emotion classification method and system based on 'word-phrase' attention mechanism
CN114579712B (en) * 2022-05-05 2022-07-15 中科雨辰科技有限公司 Text attribute extraction and matching method based on dynamic model

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301246A (en) * 2017-07-14 2017-10-27 河北工业大学 Chinese Text Categorization based on ultra-deep convolutional neural networks structural model
CN107491541A (en) * 2017-08-24 2017-12-19 北京丁牛科技有限公司 File classification method and device
CN108363753A (en) * 2018-01-30 2018-08-03 南京邮电大学 Comment text sentiment classification model is trained and sensibility classification method, device and equipment
CN108446275A (en) * 2018-03-21 2018-08-24 北京理工大学 Long text emotional orientation analytical method based on attention bilayer LSTM
CN108460089A (en) * 2018-01-23 2018-08-28 哈尔滨理工大学 Diverse characteristics based on Attention neural networks merge Chinese Text Categorization
CN108595590A (en) * 2018-04-19 2018-09-28 中国科学院电子学研究所苏州研究院 A kind of Chinese Text Categorization based on fusion attention model
CN108717439A (en) * 2018-05-16 2018-10-30 哈尔滨理工大学 A kind of Chinese Text Categorization merged based on attention mechanism and characteristic strengthening
CN108829818A (en) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 A kind of file classification method
CN108984745A (en) * 2018-07-16 2018-12-11 福州大学 A kind of neural network file classification method merging more knowledge mappings
CN109299268A (en) * 2018-10-24 2019-02-01 河南理工大学 A kind of text emotion analysis method based on dual channel model
CN109492108A (en) * 2018-11-22 2019-03-19 上海唯识律简信息科技有限公司 Multi-level fusion Document Classification Method and system based on deep learning
CN109840279A (en) * 2019-01-10 2019-06-04 山东亿云信息技术有限公司 File classification method based on convolution loop neural network
CN109858032A (en) * 2019-02-14 2019-06-07 程淑玉 Merge more granularity sentences interaction natural language inference model of Attention mechanism
CN110334210A (en) * 2019-05-30 2019-10-15 哈尔滨理工大学 A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10936862B2 (en) * 2016-11-14 2021-03-02 Kodak Alaris Inc. System and method of character recognition using fully convolutional neural networks

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301246A (en) * 2017-07-14 2017-10-27 河北工业大学 Chinese Text Categorization based on ultra-deep convolutional neural networks structural model
CN107491541A (en) * 2017-08-24 2017-12-19 北京丁牛科技有限公司 File classification method and device
CN108460089A (en) * 2018-01-23 2018-08-28 哈尔滨理工大学 Diverse characteristics based on Attention neural networks merge Chinese Text Categorization
CN108363753A (en) * 2018-01-30 2018-08-03 南京邮电大学 Comment text sentiment classification model is trained and sensibility classification method, device and equipment
CN108446275A (en) * 2018-03-21 2018-08-24 北京理工大学 Long text emotional orientation analytical method based on attention bilayer LSTM
CN108595590A (en) * 2018-04-19 2018-09-28 中国科学院电子学研究所苏州研究院 A kind of Chinese Text Categorization based on fusion attention model
CN108717439A (en) * 2018-05-16 2018-10-30 哈尔滨理工大学 A kind of Chinese Text Categorization merged based on attention mechanism and characteristic strengthening
CN108829818A (en) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 A kind of file classification method
CN108984745A (en) * 2018-07-16 2018-12-11 福州大学 A kind of neural network file classification method merging more knowledge mappings
CN109299268A (en) * 2018-10-24 2019-02-01 河南理工大学 A kind of text emotion analysis method based on dual channel model
CN109492108A (en) * 2018-11-22 2019-03-19 上海唯识律简信息科技有限公司 Multi-level fusion Document Classification Method and system based on deep learning
CN109840279A (en) * 2019-01-10 2019-06-04 山东亿云信息技术有限公司 File classification method based on convolution loop neural network
CN109858032A (en) * 2019-02-14 2019-06-07 程淑玉 Merge more granularity sentences interaction natural language inference model of Attention mechanism
CN110334210A (en) * 2019-05-30 2019-10-15 哈尔滨理工大学 A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Bidirectional LSTM with attention mechanism and convolutional layer for text classification;GangLiu;《Neurocomputing》;20190414;第337卷;325-338 *
基于Attention机制的卷积神经网络文本分类模型;赵云山;《应用科学学报》;20190730;第37卷(第4期);541-510 *
基于Bi-LSTM和CNN并包含注意力机制的社区问答问句分类方法;史梦飞等;《计算机系统应用》;20180915(第09期);159-164 *
基于卷积记忆网络的视角级微博情感分类;廖祥文等;《模式识别与人工智能》;20180315(第03期);25-35 *
基于混合神经网络的中文短文本分类模型;陈巧红;《浙江理工大学学报(自然科学版)》;20190331;第41卷(第4期);509-516 *
基于短语注意机制的文本分类;江伟等;《中文信息学报》;20180215(第02期);106-113+123 *

Also Published As

Publication number Publication date
CN111078833A (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN111078833B (en) Text classification method based on neural network
CN110609897B (en) Multi-category Chinese text classification method integrating global and local features
Zulqarnain et al. Efficient processing of GRU based on word embedding for text classification
Zhang et al. A text sentiment classification modeling method based on coordinated CNN‐LSTM‐attention model
CN109325231B (en) Method for generating word vector by multitasking model
CN111401061A (en) Method for identifying news opinion involved in case based on BERT and Bi L STM-Attention
CN109977413A (en) A kind of sentiment analysis method based on improvement CNN-LDA
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
CN110263325B (en) Chinese word segmentation system
CN111027595B (en) Double-stage semantic word vector generation method
CN112364638B (en) Personality identification method based on social text
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
CN111368088A (en) Text emotion classification method based on deep learning
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN111522908A (en) Multi-label text classification method based on BiGRU and attention mechanism
CN110046223B (en) Film evaluation emotion analysis method based on improved convolutional neural network model
CN112163089B (en) High-technology text classification method and system integrating named entity recognition
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
Liu et al. A multi-label text classification model based on ELMo and attention
CN110297986A (en) A kind of Sentiment orientation analysis method of hot microblog topic
Chen et al. Deep neural networks for multi-class sentiment classification
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN114462420A (en) False news detection method based on feature fusion model
CN113326374A (en) Short text emotion classification method and system based on feature enhancement
CN112199503B (en) Feature-enhanced unbalanced Bi-LSTM-based Chinese text classification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant