CN111858933A

CN111858933A - Character-based hierarchical text emotion analysis method and system

Info

Publication number: CN111858933A
Application number: CN202010659957.2A
Authority: CN
Inventors: 黄斐然; 王泽钒; 高博宇; 刘志全
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-10-30
Anticipated expiration: 2040-07-10
Also published as: CN111858933B

Abstract

The invention discloses a character-based hierarchical text emotion analysis method and a system, wherein the method comprises the following steps: preprocessing given text data, including designing a character set, dividing sentences in the text, and obtaining a text representation in a digital form based on the character set; establishing a neural network model at a character level: inputting the preprocessed text data into a neural network model at a character level, sequentially passing through a model embedding layer, a convolutional neural network layer and a decoding layer, and extracting and outputting a feature vector of each sentence in the text; establishing a sentence-level neural network model: and taking the output of the character level network as input, and outputting the probability distribution of the emotion classification of the text sequentially through a recurrent neural network layer, an attention layer and a decoding layer. The invention extracts the initial characteristics of the text from the character level, and the sentence level network not only contains the time sequence information, but also enables the network to tend to the sentences which are favorable for the emotion analysis result, thereby improving the accuracy and the robustness of the model.

Description

Character-based hierarchical text emotion analysis method and system

Technical Field

The invention relates to the technical field of emotion analysis of natural language processing, in particular to a hierarchical text emotion analysis method and system based on characters.

Background

With the great increase of the information amount of the internet in recent years, people can be exposed to a great amount of text information such as news, blogs, comments and the like through terminals such as mobile phones, computers and the like. Extracting important information from a large amount of texts, such as text summaries, emotional tendencies and the like, has become an urgent need for quickly understanding the information in the era of information explosion. The emotional tendency is used as higher-level abstraction of text information and has important application value. The character-based hierarchical text emotion analysis method with attention mechanism provides an efficient solution for extracting emotion tendencies from a large number of texts, can help master the main attitudes of people on hot events, candidates, commodities, movies and other things, and has great application potential for equiangular colors of consumers, managers and competitors.

The conventional text emotion analysis method based on deep learning mostly analyzes texts on the basis of words, and the methods have the following pain points: 1. in various languages around the world, the number of words is huge, for example, the number of common words in english is as high as three to forty thousand, and the words are flexibly changed along with the development of generations. 2. In order to express the relationship between words, such as near-meaning words, roots, etc., a great amount of word vectorization expressions need to be pre-trained, and the training needs massive texts as training samples, so that the consumed computing resources are more immeasurable. 3. There are low frequency words and oov (out of vocabularies) problems, i.e., some uncommon words may only appear in certain topic-specific articles, resulting in pre-trained word vectors that do not include the vectorized representation of the word (oov problem) or that the representation of the word is not sufficiently trained (low frequency word problem).

Disclosure of Invention

Aiming at the problems that the word-based text emotion analysis method is large in number and has flexibility, the problem of relationship among words, the problem of low frequency words and oov, and the problem that a character-based model is easy to over-fit and poor in model robustness, a character-based network is designed, and a sentence-level network is added on the character-level network, the invention provides a character-based neural network, which is different from the existing similar method.

However, character-based models are prone to problems of overfitting and poor model robustness due to the diversity of combinations between characters and the nature of the convolutional network. In view of the problem, the invention starts from a hierarchical idea, a sentence-level network is added on a character-level network, and the vectorization representation of the sentence is extracted from the character sequence of the sentence through the character-level network. The addition obviously relieves the overfitting problem easily appearing in the character-level network, improves the robustness of the model and enables the model to be more stable in performance.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a character-based hierarchical text emotion analysis method, which comprises the following steps:

text preprocessing: preprocessing given text data, including designing a character set, dividing sentences in the text, and obtaining a text representation in a digital form based on the character set;

establishing a neural network model at a character level: inputting the preprocessed text data into a neural network model at a character level, sequentially passing through a model embedding layer, a convolutional neural network layer and a decoding layer, and extracting and outputting a feature vector of each sentence in the text;

establishing a sentence-level neural network model: and taking the output of the character level network as input, and outputting the probability distribution of the emotion classification of the text sequentially through a recurrent neural network layer, an attention layer and a decoding layer.

As a preferred technical solution, the text preprocessing specifically includes:

designing a character set which comprises basic characters under the language of a given text, and packaging the character set into a dictionary, wherein the dictionary is used for searching for corresponding subscripts through characters and searching for corresponding characters through the subscripts;

dividing sentences in the text: dividing a single text into a set of a plurality of sentences with sentence terminators of the language of the given text as separators;

Deriving a textual representation in numeric form based on a character set: and converting each sentence in each text from the character sequence into a corresponding subscript sequence based on the dictionary, and completing the conversion of the text from the character form to the digital form.

As a preferred technical solution, the obtaining of the text representation in a numeric form based on the character set specifically includes:

character segmentation: dividing each sentence of the text into a plurality of characters and storing the characters in a character type array;

case conversion: replacing all characters forming words in the original text into a lower case form;

text digitization: comparing the dictionary, converting all characters in the text into corresponding subscripts in the dictionary, and converting the text from a character form to a digital form;

unifying sentence length: if the sentence length exceeds the set threshold value, cutting, and discarding the sentence part exceeding the length; if the sentence length does not reach the set threshold value, the subscript 0 is used for filling until the sentence length reaches the threshold value;

unifying text length: if the number of sentences in the text exceeds a set threshold value, cutting, and discarding partial sentences exceeding the number; and if the number of sentences in the text does not reach the set threshold value, filling the text with sentences with uniform length with subscripts of 0 until the number of sentences reaches the threshold value.

As a preferred technical solution, the establishing of the neural network model at the character level specifically includes the steps of:

the model embedding layer takes each sentence of the preprocessed text as an input unit, converts the subscript of each character of the sentence into corresponding unique vectorization expression, and converts the expression form of each sentence from a one-dimensional subscript sequence into a two-dimensional character vector sequence;

the convolutional neural network layer adopts a plurality of one-dimensional convolutional kernels with different sizes, simultaneously carries out convolutional operation and global maximum pooling operation on the two-dimensional character vector sequence, and splices operation results to obtain output results of multi-convolutional-kernel operation;

and the decoding layer takes the output result of the multi-convolution kernel operation as input, extracts the feature vector of the sentence through the full-connection layer, and the feature vector of the sentence is used as the input of the neural network model at the sentence level.

As a preferred technical solution, the method for converting the subscript of each character of the sentence into a corresponding unique vectorization representation specifically adopts the following steps: and carrying out one-hot coding on the character subscript in the sentence.

As a preferred technical solution, the performing convolution operation and global maximum pooling operation on the two-dimensional character vector sequence specifically includes: and performing single-layer convolution operation on the two-dimensional character vector sequence and connecting a nonlinear activation function ReLU, wherein the step length of the convolution operation is set to be 1.

As a preferred technical solution, the establishing of the sentence-level neural network model specifically includes the steps of:

the cyclic neural network layer takes the output of the neural network model at the character level as input, and obtains the output and context vector of each time step through the bidirectional cyclic neural network;

the attention layer adopts an attention mechanism, the output of the bidirectional recurrent neural network is used as a value item, the output of the bidirectional recurrent neural network after being connected with a full connection layer is used as a key item, the following vector is used as a query item, the weight distribution of the output of the recurrent neural network at each time step is obtained, and the output and the weight are multiplied and then added to obtain the vector representation of the whole text;

and outputting the numerical distribution of emotion classification through a full connection layer by using the vector representation of the whole text, and converting the result into probability distribution of emotion classification by adopting softmax operation, wherein the probability with a larger probability is an emotion analysis prediction result.

As a preferred technical solution, the method includes the steps of using the output of the bidirectional recurrent neural network as a value item, using the output after the bidirectional recurrent neural network is connected with a full connection layer as a key item, using a following vector as a query item, obtaining a weight distribution of the output of the recurrent neural network at each time step, multiplying the output and the weight, and then adding the multiplied output and the weight to obtain a vector representation of the whole text, and specifically includes:

The output of each time step passes through a single-layer multi-layer perceptron, Tanh is taken as an activation function to obtain a hidden representation of the output, and the hidden representation is taken as a key item in an attention mechanism;

taking the following vector as a query item of an attention mechanism, sequentially carrying out vector multiplication on key items and the query item to obtain attention distribution output for each time step, then carrying out softmax operation on the attention distribution, and converting the attention distribution into probability distribution so that the sum of the proportions of all the time steps in the attention is 1;

and taking the output of each time step as a value item of an attention mechanism, multiplying the attention proportion occupied by each time step by a corresponding value, and adding the obtained results of all the time steps to obtain the sum of all sentence vectors in the text based on the weight, namely the feature vector of the text.

The invention also provides a hierarchical text emotion analysis system based on characters, which comprises the following components: the system comprises a text preprocessing module, a character-level neural network model establishing module and a sentence-level neural network model establishing module;

the text preprocessing module is used for preprocessing given text data, and comprises a character set design module, a text input module, a text output module and a text display module, wherein the character set design module is used for dividing sentences in the text and obtaining text representation in a digital form based on the character set;

The character-level neural network model establishing module is used for inputting the preprocessed text data into the character-level neural network model, sequentially passing through the model embedding layer, the convolutional neural network layer and the decoding layer, and extracting and outputting a feature vector of each sentence in the text;

the sentence-level neural network model building module is used for outputting probability distribution of emotion classification of the text by taking output of the character-level network as input and sequentially passing through the recurrent neural network layer, the attention layer and the decoding layer.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the method effectively solves the problems of large quantity of words, flexibility, relationship among words, low frequency words and oov of words faced by the word-based text emotion analysis method, and poor overfitting and model robustness of a character-based model, obviously reduces the storage overhead and calculation resource overhead required by text emotion analysis, and improves the accuracy and robustness of the model.

(2) Firstly, extracting the characteristics of a given text based on a designed character set; then, extracting sentence-level feature vectors of a given text by adopting a single-layer Convolutional Neural Network (CNN) containing a plurality of convolutional kernels; extracting feature vectors of the whole text through a bidirectional Recurrent Neural Network (RNN) based on an attention mechanism; finally, connecting the full connection layers, performing softmax operation on the result to obtain probability distribution of text emotion classification, and performing initial feature extraction on the text from the character level, so that a pre-trained word vector is not needed, the problem of low-frequency words is avoided, and the language universality is good; the sentence-level network not only contains time sequence information, but also enables the network to tend to be beneficial sentences for emotion analysis results, and improves the accuracy and the robustness of the model.

Drawings

FIG. 1 is a schematic diagram of an overall framework of a character-based hierarchical text emotion analysis method according to the present embodiment;

FIG. 2 is a block diagram of the data preprocessing of the present embodiment;

FIG. 3 is a block diagram of a character-level neural network model according to the present embodiment;

fig. 4 is a frame diagram of the sentence-level neural network model according to the embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Examples

As shown in fig. 1, the present embodiment provides a character-based hierarchical text emotion analysis method, which includes the following steps:

s1: text preprocessing: as shown in fig. 2, given text data is preprocessed, including designing a character set, dividing sentences in the text, and obtaining a text representation in a digital form based on the character set;

step S1 is an input data preprocessing, specifically including the following substeps:

s11: designing a set of characters

Designing a character set which comprises basic characters under the language of a given text, packaging the character set into a dictionary, and finding out subscripts of the character set through the characters and finding out corresponding characters through the subscripts;

Generally, the basic characters of a language mainly include characters (such as letters in english), arabic numerals (0-9), punctuation marks (, |;

s12: partitioning sentences in text

Dividing a single text into a set of a plurality of sentences by taking sentence end symbols of a language of a given text as separators, wherein the sentence end symbols are mainly periods, exclamation marks, question marks and the like in general;

s13: deriving a textual representation in digital form based on a character set

Converting each sentence in each text from the character sequence to the corresponding subscript sequence by using the dictionary formed in step S11, thereby completing the conversion of the text from the character form to the number form.

Which comprises the following more detailed steps:

character segmentation. Each sentence of the text is equivalent to a character string, and the character string is divided into a plurality of characters and stored in the character type array. Thus, a text becomes a two-dimensional array, with a first dimension storing a plurality of sentences of text and a second dimension storing a plurality of characters under each sentence.

Case and case transformation. For some languages, the characters that make up a word are case-specific, such as English. The presence of capital and lowercase characters can affect the representation of the model because the capital and lowercase characters do not change the meaning of the words of which the capital and lowercase characters constitute, and the occurrence frequency of capital characters in the text is relatively low compared with that of lowercase characters, and if the capital characters are included in the character set, the overfitting phenomenon of the model can be aggravated. For the above reasons, all the characters constituting a word in the original text are replaced in a lower case form.

Text digitization. All characters in the text are converted into their corresponding subscripts in the dictionary against the dictionary formed in step S11, so that the text is converted from a character form into a number form.

Uniform sentence length. In order to process text data in batch and improve the text processing efficiency of the model, the lengths of all sentences in the text need to be unified. If the sentence length exceeds the set threshold value, cutting, and discarding the sentence part exceeding the length; and if the sentence length does not reach the set threshold value, filling the sentence length with the subscript 0 until the sentence length reaches the threshold value.

Uniform text length. In order to process text data in batch and improve the efficiency of model processing, the text length needs to be unified. Since the sentence length is unified in the previous step, the step only needs to unify the number of sentences in the text. If the number of sentences in the text exceeds a set threshold value, cutting, and discarding partial sentences exceeding the number; and if the number of sentences in the text does not reach the set threshold value, filling the text with sentences with uniform length with subscripts of 0 until the number of sentences reaches the threshold value.

S2: establishing a neural network model at a character level: as shown in fig. 3, inputting the preprocessed text data into a neural network model at a character level, sequentially passing through a model embedding layer, a convolutional neural network layer and a decoding layer, and extracting and outputting a feature vector of each sentence in the text;

step S2 is to establish a neural network model at a character level, which specifically includes the following substeps:

s21: model embedding layer

Taking each sentence of the preprocessed text as an input unit, converting the subscript of each character of the sentence into a corresponding unique vectorization expression, and specifically comprising the following steps:

taking each sentence of the preprocessed text as an input unit, performing one-hot coding on the subscript of the character in the sentence, converting the subscript of each character in the sentence into corresponding unique vectorization expression, and converting the expression form of each sentence into a two-dimensional character vector sequence from a one-dimensional subscript sequence;

s22: convolutional neural network layer

Using a plurality of one-dimensional convolution kernels with different sizes to simultaneously carry out convolution operation and global maximum pooling operation on the sentence sequence output by the step S21, splicing operation results together, and specifically comprising the following steps:

performing single-layer convolution operation on the sentence sequence output in the step S21 by using a plurality of one-dimensional convolution kernels with different sizes, and connecting the single-layer convolution operation with the nonlinear activation function ReLU, wherein the step length of the convolution operation is 1, the number of the convolution kernels, the size of the convolution kernels and the number of output channels are model hyper-parameters, the adjustment is needed according to a specific data set and a training process, then obtaining the maximum value of each channel through a global maximum pooling layer according to the operation result of each convolution kernel, and connecting the output along channel dimensions to obtain the characteristic extracted from the sentence based on the multi-element group idea;

S23: decoding layer

And taking the output result of the multi-convolution kernel operation in the step S22 as input, extracting the feature vector of the sentence through a full connection layer, wherein the feature vector of the sentence is taken as the input of a sentence-level model, the dimension of the sentence is a model hyper-parameter, and the sentence is required to be adjusted according to a specific data set and a training process.

S3: establishing a sentence-level neural network model: as shown in fig. 4, the output of the character-level network is used as input, and the probability distribution of emotion classification of the text is output through the recurrent neural network layer, the attention layer, and the decoding layer in this order.

The step S3 of establishing a sentence-level neural network model specifically includes the following substeps:

s31: recurrent neural network layer

The output of the character-level network, namely the vector representation of each sentence of the text, is used as the input, the output and the context vector (namely the final hidden state) of each time step are obtained through a bidirectional cyclic neural network, the hidden layer dimension of the hidden layer dimension is used as a model hyper-parameter, and the hidden layer dimension needs to be adjusted according to a specific data set and a training process;

s32: attention layer

By adopting an attention mechanism, the output of the recurrent neural network in the step S31 is taken as a value item, the output after the output is connected with a full connection layer is taken as a key item, and a following vector is taken as a query item, so that the weight distribution of the output of the recurrent neural network at each time step is obtained, and the output and the weight are multiplied and then added to obtain the vector representation of the whole text, wherein the method comprises the following more detailed steps:

Passing the output of each time step in step S31 through a single-layer multi-layer perceptron and taking Tanh as an activation function to obtain a hidden representation of the output, and keeping the dimension of the output unchanged before and after conversion, wherein the term is a key term in the attention mechanism;

the attention distribution output for each time step is obtained by vector-multiplying the key item obtained in the previous step and the query item in the present step in sequence, using the context vector in step S31 as the query item of the attention mechanism. Then performing softmax operation on the attention distribution, and converting the distribution into probability distribution so that the sum of the proportions of all time steps in the attention is 1;

the output of each time step in step S31 is used as a value term of the attention mechanism, and the proportion of attention occupied by each time step (refer to the probability distribution of attention in the previous step) is multiplied by the corresponding value. Adding the obtained results of all time steps to obtain the sum of all sentence vectors in the text based on the weight, namely the feature vector of the text;

s33: decoding layer

And (5) expressing the numerical distribution of the emotion classification output through the full connection layer by using the vector of the text output in the step (S32), and converting the result into the probability distribution of the emotion classification by using softmax operation, wherein the higher probability is the emotion analysis prediction result.

The embodiment also provides a hierarchical text emotion analysis system based on characters, which comprises: the system comprises a text preprocessing module, a character-level neural network model establishing module and a sentence-level neural network model establishing module;

in this embodiment, the text preprocessing module is configured to preprocess given text data, including designing a character set, dividing sentences in a text, and obtaining a text representation in a digital form based on the character set;

in this embodiment, the character-level neural network model building module is configured to input the preprocessed text data into the character-level neural network model, sequentially pass through the model embedding layer, the convolutional neural network layer, and the decoding layer, and extract and output a feature vector of each sentence in the text;

in this embodiment, the sentence-level neural network model building module is configured to output the probability distribution of emotion classification of a text by using the output of the character-level network as input and sequentially passing through the recurrent neural network layer, the attention layer, and the decoding layer.

In the embodiment, a hierarchical neural network model from a character level to a sentence level and from the sentence level to a text level is established, and the method can be used for carrying out emotion classification on common texts such as comments, blogs and the like. The model is as follows: 1. firstly, extracting the characteristics of a given text based on a designed character set; 2. then, extracting sentence-level feature vectors of a given text by adopting a single-layer Convolutional Neural Network (CNN) containing a plurality of convolutional kernels; 3. extracting feature vectors of the whole text through a bidirectional Recurrent Neural Network (RNN) based on an attention mechanism; 4. and finally, connecting the full connection layers and performing softmax operation on the result to obtain the probability distribution of the text emotion classification.

The embodiment extracts the initial features of the text from the character level, does not need pre-trained word vectors, does not have the problem of low-frequency words, and has good language universality; the sentence-level network not only contains time sequence information, but also enables the network to tend to be beneficial sentences for emotion analysis results, and improves the accuracy and the robustness of the model.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A hierarchical text emotion analysis method based on characters is characterized by comprising the following steps:

2. The method for hierarchical emotion analysis of texts based on characters as claimed in claim 1, wherein the specific steps of text preprocessing include:

3. The method for hierarchical emotion analysis of texts based on characters as claimed in claim 2, wherein the text representation in digital form is obtained based on character set, and the specific steps include:

4. The method for hierarchical emotion analysis of texts based on characters according to claim 1, wherein the establishing of the neural network model at the character level comprises the following specific steps:

5. The method according to claim 4, wherein the index of each character in the sentence is converted into a corresponding unique vectorized representation by: and carrying out one-hot coding on the character subscript in the sentence.

6. The method for hierarchical emotion analysis of texts based on characters according to claim 4, wherein the convolution operation and the global maximum pooling operation are performed on the two-dimensional character vector sequence, and the specific steps include: and performing single-layer convolution operation on the two-dimensional character vector sequence and connecting a nonlinear activation function ReLU, wherein the step length of the convolution operation is set to be 1.

7. The method for hierarchical emotion analysis of texts based on characters according to claim 1, wherein the step of establishing the neural network model at sentence level comprises the following specific steps:

8. The method for analyzing hierarchical text emotion based on characters according to claim 7, wherein the method comprises the following specific steps of taking the output of the bidirectional recurrent neural network as a value item, taking the output after the bidirectional recurrent neural network is connected with a full connection layer as a key item, taking a context vector as a query item, obtaining the weight distribution of the output of each time step of the recurrent neural network, multiplying the output and the weight, and then adding the multiplied output and the weight to obtain the vector representation of the whole text:

9. A hierarchical character-based text emotion analysis system, comprising: the system comprises a text preprocessing module, a character-level neural network model establishing module and a sentence-level neural network model establishing module;