WO2019080863A1

WO2019080863A1 - Text sentiment classification method, storage medium and computer

Info

Publication number: WO2019080863A1
Application number: PCT/CN2018/111607
Authority: WO
Inventors: 曾伟波; 郑耀松; 倪时龙; 苏江文; 许成功; 吕君玉; 何天尝; 林祥仙
Original assignee: 福建亿榕信息技术有限公司; 国家电网有限公司; 国网信息通信产业集团有限公司
Priority date: 2017-10-26
Filing date: 2018-10-24
Publication date: 2019-05-02
Also published as: CN107590134A

Abstract

A text sentiment classification method, a storage medium and a computer. Said method comprises the following steps: constructing a sentiment dictionary for input text, the step of constructing a sentiment dictionary comprising selecting and expressing parts of speech, extracting base-level feature vectors; extracting mid-level features, and in combination with the sentiment dictionary, acquiring word vectors of training samples and pooling the word vectors of the training samples, so as to obtain mid-level feature vectors; performing weighted fusion on the base-level feature vectors and the mid-level feature vectors, so as to obtain fused feature vectors; calculating a classification result on the basis of a base-level feature vector classification model, a mid-level feature vector classification model and a fused feature vector classification model. The present invention solves the problem in the prior art that the sentiment classification is not efficient and stable enough.

Description

Text sentiment classification method, storage medium and computer

Cross-reference to related applications

The present application is based on a Chinese patent application filed on Jan. 26, 2017, filed on Jan. 26, s.

Technical field

The present invention relates to the field of machine learning, and in particular to a method and a storage medium for text sentiment classification.

Background technique

A sentiment classification, which is mainly used to analyze or predict the emotional category to which a text with emotional orientation belongs. Generally divided into positive, negative or positive, negative and neutral. According to the difference in size and granularity of the research object, the sentiment analysis technique can be roughly divided into the following three levels: word level, sentence level and chapter level emotion analysis.

The word-level sentiment classification can be divided into a dictionary-based sentiment classification model and a corpus-based sentiment classification model. The dictionary-based sentiment classification model relies on the synonymous and antisense relations in the existing dictionary to judge the emotional tendency of words in the text. Some scholars use words such as "good" and "bad" as the benchmark words, and then calculate the difference between the mutual information between the registered words and the reference words. Some scholars use HowNet to detect the fuzzy emotion categories of adjectives in the text, and calculate the net coverage scores to distinguish the adjectives with uncertain emotion categories and the core adjectives determined by emotional categories. The corpus-based sentiment classification model mainly identifies the sentiment orientation of words by statistical analysis of existing corpora. Some scholars have proposed a method based on the theory of emotional consistency. They think that different connected words contain potential semantic relations, so the use of connected words in the corpus can dig out the semantic emotions of unregistered words. Some scholars have proposed a method to solve the domain dependence of emotional words. Firstly, the existing corpus is used to extract the emotional and emotional objects in the text, and then they are formed into an emotional matching pair. The heuristic algorithm is used to calculate each emotion. With the pair of emotions, the final result is constructed into an emotional collocation dictionary, which solves the context dependence of emotional words to a certain extent.

Sentence-based sentiment classification can be divided into two sub-directions: semantic-based sentiment classification and statistical-based sentiment classification. Semantic-based sentiment classification needs to match the sentiment dictionary to find the emotional words in the sentence, and then calculate the emotion of the whole sentence through the emotional intensity or polarity of the emotional words. Some scholars try to use the rhetorical structure theory to solve the problem of sentiment orientation of sentences. Firstly, according to the theory, the sentences are divided into different blocks of text elements, and each element block is assigned different weights according to the importance of the overall emotion of the document. Emotional prediction is obtained by weighting the sentiment score of the sentence as a whole. The statistical-based sentiment analysis method is based on the machine learning method. Using the already labeled data, a model is trained by the machine learning algorithm, and then the model is used to predict the emotional tendency of the unknown text data. Some scholars try to construct feature vectors by using the number of positive and negative emotion words, negative words, special keywords, part-of-speech tags, and emojis, etc., and use machine learning to classify the sentiment data with emotional tendency. The heat of learning, some scholars use the recurrent neural network to combine the phrase vector and the word vector and send it into the classifier as a feature to analyze the sentiment orientation. The experiment proves the effectiveness of the method.

Based on chapter-level sentiment classification, I mainly study the overall emotions of text-level texts such as news and blogs. The focus of the research is on the semantic information of the text. Some scholars have proposed methods to analyze the phrase phrases appearing in text-level texts. By analyzing the sentiment orientation of these evaluation phrases, semi-automatically construct an emotional dictionary, and then use the emotional dictionary to analyze the overall emotion of the text. Emotional analysis of text-based texts based on machine learning is more common. This method uses emotion resources, phrases and other resources to construct a sentiment classification model of text-level text through the support machine vector machine. In addition, there is another method to divide the chapter-level text into multiple sentences, and use the maximum entropy algorithm to analyze each sentence emotionally; then combine the emotional tendency of the sentence with its position, sentence and other characteristics to form the characteristics of the text. Sending a support vector machine and training the emotional classifier of chapter-level text also achieved good results.

Summary of the invention

To this end, it is necessary to provide a text sentiment classification method to solve the problem that the prior art emotion classification is not efficient and stable.

To achieve the above object, the inventors provide a text sentiment classification method, comprising the following steps: performing an emotional dictionary construction on an input text, the emotional dictionary construction step including a part-of-speech selection expression, an underlying feature vector extraction, a middle layer feature extraction, and a combination The sentiment dictionary is collected, and the word vector of the training sample is collected, and the word vector of the training sample is pooled to obtain a middle layer feature vector; the underlying feature vector and the middle layer feature vector are weighted and merged to obtain a fusion feature vector, which is respectively based on the underlying feature vector The classification model, the middle eigenvector classification model, and the fusion eigenvector classification model are used to calculate the classification results.

In the above solution, the underlying vector extraction is specifically performed by using a vector space model for the underlying features, wherein each dimension is characterized by a normalized TF-TDF weight.

In the above solution, the underlying feature vector and the middle layer feature vector are weighted and expressed as

Where L is the underlying eigenvector and M is the middle eigenvector.

For the weight of the underlying feature, || represents the symbol of the concatenation.

In the above solution, the step of pooling the word vector comprises: dividing the number of dimensions of the underlying feature vector into several parts, summing the word vectors in each dimension, and then summing the summation results in order The order is combined to merge the results.

A text sentiment classification storage medium storing a computer program, when executed by a processor, implements the following steps: performing an emotional dictionary construction on the input text, the emotional dictionary construction step including a part-of-speech selection expression and an underlying feature vector extraction The middle layer feature extraction, combined with the sentiment dictionary, collects the word vector of the training sample, and pools the word vector of the training sample to obtain the middle layer feature vector; and performs weighted fusion on the bottom layer feature vector and the middle layer feature vector to obtain the fusion feature The vector is calculated based on the underlying eigenvector classification model, the middle eigenvector classification model, and the fused feature vector classification model.

Specifically, the underlying feature vector and the middle layer feature vector are weighted and expressed as

Where L is the underlying eigenvector and M is the middle eigenvector.

In the above solution, the step of pooling the word vector further includes dividing the number of dimensions of the underlying feature vector into several parts, summing the word vectors in each dimension, and then summing the summation results in sequence The order is combined to merge the results.

A computer comprising the above described storage medium.

Different from the prior art, the present invention can establish an efficient and stable emotional dictionary with low dimension through learning, continue to use the emotional dictionary, and combine the feature fusion and the classifier fusion method to effectively improve the classification accuracy, through the bottom layer, The middle layer, the fusion feature vector, and the three classifiers to generate the classification result can make the final classification result more stable and more robust. The calculation amount of the method of the present invention is also reduced by the detailed pooling process. In summary, the present invention solves the problem that the prior art text emotion classification is not efficient and the classification accuracy is insufficient.

DRAWINGS

1 is a flowchart of a text sentiment classification method according to an embodiment of the present invention;

2 is a schematic diagram of a whole process of a text sentiment classification method according to an embodiment of the present invention;

3 is a diagram showing a pooling process according to an embodiment of the present invention;

4 is a feature fusion diagram according to an embodiment of the present invention.

Detailed ways

The detailed description of the technical content, structural features, and the objects and effects of the technical solutions will be described in detail below with reference to the specific embodiments and the accompanying drawings.

Please refer to FIG. 1 , which is a text sentiment classification method. The method is based on the sentiment classification model of the extreme learning machine. The extreme learning machine is a single-hidden layer feedforward neural network (SLFNs). The network consists of an input layer, a hidden layer and an output layer. The input layer is hidden to the hidden layer and the hidden layer. There is a full connection between the output layers. The method of the invention can begin in steps,

S100 performs an emotional dictionary construction on the input text, and the sentiment dictionary construction step includes a part of speech selection expression and an underlying feature vector extraction. In some embodiments, as shown in FIG. 2, the sentiment dictionary construction step includes two processes of part of speech selection and underlying feature selection. Part of speech selection In the present invention, nouns, verbs, adjectives, and adverbs are collectively used as a reference word, and the sentiment dictionary can be a set of four word-of-speech reference words that appear in all the selected materials. Combine the words with different parts of speech to form the latent semantic information of a document, which can ensure the coverage of the sentiment dictionary to the greatest extent, while retaining the semantic information of the document. Stratigraphic feature vector extraction uses the underlying feature selection principle based on chi-square statistics to further select the feature words that best represent the emotional polarity of the text. The underlying feature selection vector space model is expressed, wherein the feature of each dimension in the vector is the normalized TF-IDF weight.

In step S102, the layer feature extraction is combined with the sentiment dictionary to collect the word vector of the training sample, and the word vector of the training sample is pooled to obtain the middle layer feature vector; specifically, we can train the Skip-gram model in an unsupervised manner, and use The trained model inputs the training samples and generates a training sample word vector. The specific pooling steps are shown in Figure 3:

(1) Divide the number of dimensions of the word vector into several parts, sum the word vectors in each dimension, and then combine the summation results in sequential order. Suppose the text contains x words. After the underlying feature extraction, there are t words. This text is represented as T=(w ₁ , w ₂ ,...w _t ), where the word vector of each word is, each Word vectors have k-dimensional features;

(2) Dividing the word vector in the text T into N parts to form N word vector groups, each group corresponding to t/N word vectors;

(3) For each word vector group, the following operations are performed: all word vectors in the group are accumulated, and finally each word vector group forms a feature vector v(z), and the dimension of the feature vector is also k;

(4) The feature vectors of the N word vector groups are concatenated to obtain a brand new vector of the whole document, as shown by the formula: v(z ₁ )||v(z ₂ )||...||v (z _N ). Where || represents the symbol of the concatenation.

As shown in FIG. 1 and FIG. 4, the present invention further performs step S104 to perform weighted fusion on the bottom layer feature vector and the middle layer feature vector to obtain a fusion feature vector, and the S106 is respectively based on the underlying feature vector classification model, the middle layer feature vector classification model, and the fusion. The eigenvector classification model calculates the classification result. In some embodiments, referring to FIG. 2, the specific process of classifying the sentiment to which the input sample belongs is: respectively feeding the underlying feature, the middle layer feature, and the fusion feature of the sample to be determined into the corresponding trained extreme learning machine. In the sentiment classification model, the output result vectors of the three classification models are added together to obtain the final discriminant vector, and the median maximum corresponding label of the vector is the final emotion category.

Through the above steps, the present invention can establish an efficient and stable emotional dictionary with low dimension through learning, continue to use the emotional dictionary, and combine the feature fusion and the classifier fusion method to effectively improve the classification accuracy, through the bottom layer, the middle layer, By merging the feature vectors and then generating the classification results through three classifiers, the final classification results can be made more stable and robust. The calculation amount of the method of the present invention is also reduced by the detailed pooling process. In summary, the present invention solves the problem that the prior art text emotion classification is not efficient and the classification accuracy is insufficient.

In some further embodiments, the underlying feature vector, the middle layer feature vector weighted fusion is expressed as,

Where L is the underlying eigenvector and M is the middle eigenvector.

For the weight of the underlying feature, || represents the symbol of the concatenation. In the above manner, the combination ratio of the underlying feature vector and the middle layer feature vector can be accurately adjusted according to user needs. By better fitting and adjusting the combination mode, the effect of improving the classification accuracy of the model can be better achieved.

In some embodiments, steps may be performed before step S100 to preprocess the text to remove information that is irrelevant to the task, such as specification encoding format, removal of illegal characters, word segmentation, and part-of-speech tagging processing and stop word processing. . The canonical coding format is used for unified text encoding operations, such as unifying text content into UTF-8 encoding format; removing illegal characters can use regular expression matching to filter illegal characters; word segmentation tagging processing using ICTCLAS Chinese lexical analysis The system performs word segmentation and part-of-speech tagging; stop word processing uses the stop word table to filter words that often appear in the text but have little meaning for sentiment analysis. Through the pre-processing process, the pertinence and adaptability of the text segmentation to the classifier can be improved, and the recognition efficiency of the text by the method of the invention is greatly accelerated.

Further, the underlying vector extraction is specifically performed by using a vector space model for the underlying features, wherein each dimension is characterized by a normalized TF-TDF weight.

Where L is the underlying eigenvector and M is the middle eigenvector.

Preferably, the step of pooling the word vector further comprises: dividing the number of dimensions of the underlying feature vector into several parts, summing the word vectors in each dimension, and then summing the summation results in order Combine the summation results.

A computer comprising the above described storage medium. By designing the above storage medium and computer, the present invention solves the problem that the prior art text emotion classification is not efficient and the classification accuracy is insufficient.

It should be noted that although the above embodiments have been described herein, the scope of the invention is not limited thereby. Therefore, based on the innovative concept of the present invention, the above technical solutions are directly or indirectly applied to the changes and modifications made to the embodiments described herein, or the equivalent structures or equivalent processes transformed by the contents of the specification and drawings of the present invention. All other related technical fields are included in the scope of patent protection of the present invention.

Claims

A text sentiment classification method includes the following steps:

Performing an emotional dictionary construction on the input text, the emotional dictionary construction step including a part-of-speech selection expression and an underlying feature vector extraction;

The middle layer feature extraction is combined with the sentiment dictionary to collect the word vector of the training sample, and the word vector of the training sample is pooled to obtain the middle layer feature vector;

The underlying feature vector and the middle layer feature vector are weighted and fused to obtain the fused feature vector, and the classification result is calculated based on the underlying eigenvector classification model, the middle eigenvector classification model and the fused feature vector classification model.
The text sentiment classification method according to claim 1, wherein the underlying vector is extracted by using a vector space model for the underlying features, wherein each dimension is characterized by a normalized TF-TDF weight.
The text sentiment classification method according to claim 1, wherein the underlying feature vector and the middle layer feature vector are weighted and expressed as

Where L is the underlying eigenvector and M is the middle eigenvector.
For the weight of the underlying feature, || represents the symbol of the concatenation.
The text sentiment classification method according to claim 1, wherein the step of pooling the word vector comprises:

The number of dimensions of the underlying feature vector is equally divided into several parts, the word vectors in each dimension are summed, and the summation results are combined in the order of the summation results.
A text sentiment classification storage medium storing a computer program, when executed by a processor, implements the following steps: performing an emotional dictionary construction on the input text, the emotional dictionary construction step including a part-of-speech selection expression and an underlying feature vector extraction ;

The middle layer feature extraction is combined with the sentiment dictionary to collect the word vector of the training sample, and the word vector of the training sample is pooled to obtain the middle layer feature vector;

The underlying feature vector and the middle layer feature vector are weighted and fused to obtain the fused feature vector, and the classification result is calculated based on the underlying eigenvector classification model, the middle eigenvector classification model and the fused feature vector classification model.
The text sentiment classification storage medium according to claim 5, wherein the underlying vector extraction is specifically performed by using a vector space model for the underlying features, wherein each dimension is characterized by a normalized TF-TDF weight.
The text sentiment classification storage medium according to claim 5, wherein the underlying feature vector and the middle layer feature vector are weighted and expressed as

Where L is the underlying eigenvector and M is the middle eigenvector.
For the weight of the underlying feature, || represents the symbol of the concatenation.
The text sentiment classification storage medium according to claim 5, wherein the step of pooling the word vector further comprises dividing the number of dimensions of the underlying feature vector into a plurality of parts, and the word vector in each dimension The summation is performed, and the summation results are combined in the order of the summation results.
A computer comprising the storage medium of any of claims 5-8.