CN104965822A

CN104965822A - Emotion analysis method for Chinese texts based on computer information processing technology

Info

Publication number: CN104965822A
Application number: CN201510452024.5A
Authority: CN
Inventors: 高琰; 晁绪耀
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2015-07-29
Filing date: 2015-07-29
Publication date: 2015-10-07
Anticipated expiration: 2035-07-29
Also published as: CN104965822B

Abstract

The invention discloses an emotion analysis method for Chinese texts based on computer information processing technology. Comments on Chinese products are subjected to word segmentation. By utilizing a bag-of-words model, vector representations of product comments are generated. The vector of every comment is inputted to a visible unit of a limited Boltzmann machine (RBM) in deep learning. Sentimental characteristics of Chinese texts are extracted by the RBM and the extracted emotional characteristics are inputted to a SVM for text emotion classification. The emotion analysis method for the Chinese texts based on computer information processing technology is capable of improving relevance of emotional semantics of characteristics while the SVM is capable of improving accuracy of emotion classification of comments on Chinese products.

Description

Chinese text emotion analysis method based on computer information processing technology

Technical Field

The invention relates to a natural language processing technology, in particular to a Chinese text emotion analysis method based on a computer information processing technology.

Background

With the rapid development of the internet and electronic commerce, people increasingly like to shop on the internet and publish their own opinions and opinions about a certain product on the internet. Emotional analysis is a popular research field in artificial intelligence, and aims to enable a computer to extract the emotional tendency of an author to an evaluation object, namely positive direction or negative direction, from subjective text generated by a user. Positive refers to the user's liking for the product, their recognition and positive attitude for the product; negative refers to a situation where a user has a negative attitude with respect to a product due to the product having a defect or being discontented.

At present, the main methods for analyzing the sentiment of the network commodity comment are a method based on sentiment knowledge and a method based on machine learning.

The method based on the emotional knowledge mainly depends on some existing emotional dictionaries and language knowledge to classify the emotional tendency of the text. For example, Polanyi et al focus on constructing several emotion dictionaries and emotion rules for polar words and related non-component content words; velikovich et al show how to learn a large emotion dictionary using a seed dictionary and generation frame for emotion analysis of text. Lee et al determine the emotional tendency of a text by comparing the number of positive words and negative words, i.e., which of the larger number of words is the emotional tendency of the entire text. The emotion dictionary mainly depends on the formulation of experts and cannot reflect the comprehensiveness of the emotion dictionary.

Methods based on machine learning include supervised and unsupervised learning. Turney et al have proposed a method for performing sentiment analysis on document-level text using unsupervised learning; the Tanghufeng and the like use an n-gram model to extract nouns, verbs, adverbs, adjectives and the like as the characteristics of the text, and also respectively use KNN, Bayes and SVM classifiers to classify the text, and simultaneously train on different scales and different characteristic quantities, so as to identify the emotional tendency analysis of the text; sunyhan et al propose an unsupervised topic emotion mixed model, which collects emotion labels and topic labels of each sentence on the basis of not labeling a text, thereby obtaining emotion words of the text and further realizing classification of document-level emotions. However, product reviews of Chinese texts are oriented to Chinese with multiple meanings and rules, and when automatic emotion analysis is performed on the Chinese texts by machine learning, accuracy is often low due to improper features selected or extracted.

In machine learning-based emotion analysis, feature extraction is an important factor in determining the effect of analysis. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data. Deep learning consists in building, simulating the human brain for analytical learning neural networks, which mimic the mechanisms of the human brain to interpret data such as images, sounds and text. The RBM is an algorithm for deep learning, has the characteristic that deep learning learns layer by layer to obtain data structure information, and can be used for extracting features representing emotion polarity.

Disclosure of Invention

The invention aims to solve the technical problem of providing a Chinese text sentiment analysis method based on a computer information processing technology aiming at the defects of the prior art.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a Chinese text sentiment analysis method based on computer information processing technology comprises the following steps:

1) inputting a Chinese product comment text set, performing Chinese word segmentation by using an accurate mode in the Chinese product result segmentation, determining part of speech, filtering according to the part of speech, keeping the part of speech as nouns, adjectives, adverbs and conjunctions, obtaining a corpus set after word segmentation by Chinese word segmentation, calculating the text segmented into words by Chinese word segmentation to obtain the weight of each word at each comment, and generating a comment vector;

2) feature selection is carried out on the comment vectors by adopting a restricted Boltzmann machine; the restricted Boltzmann machine is composed of a visible layer V and a hidden layer H, node neurons in the visible layer are binary random numbers, nodes of the visible layer are mutually independent, and nodes of the hidden layer are mutually independent; the connection weight between the visible layer node and the hidden layer node is W; inputting the comment vectors generated in the step 1) into a visible unit layer;

3) the restricted Boltzmann machine is trained by a contrast divergence method, relevant parameters of the restricted Boltzmann machine are initialized, and after 100 iterations, a hidden layer node h is obtained_i1 conditional probability p (h) in the visible layer v_i1/v) and a visual layer node v_iConditional probability p (v) under hidden layer h of 1_i1/h); wherein,

function(s)Offset a_iAnd b_jThe initial values are all 0; w is a_ijIs the connection weight value, w, between the node i of the visible layer and the node j of the hidden layer_ijObeying a normal distribution with a mean value of 0 and a standard deviation of 1;

4) to visible layer node v according to hidden layer_iCarrying out reconstruction; if v is_iIs less than or equal to 0.01, the visible node v is considered_iFailing to represent higher level structures, discarding v_iCorresponding words; if v is_iIs greater than 0.01, v is_iSubtracting the reconstructed value and the input value to obtain a difference value; the maximum emotional feature number of each comment is set to be 30, the difference values are arranged from small to large, and the first 30 areWord v_iThe difference value of (a) is smaller than that of the following word, indicating that v_iNo change occurs after reconstruction, and then v is found out_iCorresponding Chinese words are taken as characteristic words; the difference between the later word and the previous 30 words is larger, which indicates v_iAfter reconstruction, if the result is changed greatly, v is converted_iCorresponding as a non-characteristic word; if the number of visible nodes with reconstruction values larger than 0.01 in a certain comment is smaller than 30, all the words corresponding to the visible nodes are reserved as feature words; the reserved feature words jointly form a vector space of each comment;

5) inputting the comment vectors newly generated after the processing of the step 4) into a support vector machine for classification to obtain the emotion category of each comment vector; the emotion classification is positive or negative.

In the step 1), the generation process of the comment vector is as follows: and (3) adopting a word bag model in natural language processing, if a certain word appears in a certain comment, setting the corresponding weight value of the word in a comment vector to be 1, otherwise, setting the value to be 0, and setting the length of the comment vector to be the size of a dictionary for extracting the word.

The number of nodes of the visible layer is 476 and the number of nodes of the hidden layer is 50.

Compared with the prior art, the invention has the beneficial effects that: according to the method, a deep learning algorithm RBM is adopted to conduct high-level abstraction on original feature information of Chinese product reviews, the original feature information is reconstructed according to the high-level information, feature selection is conducted according to a reconstruction value, and feature words capable of expressing high-level structure information better are obtained. The method can extract the characteristics which can express the emotional essence of the product comment; the new features obtained by feature extraction by the RBM are the combination of the original features, and feature selection by the RBM is to select the feature words of the comments according to comparison between the reconstructed comments and the original comments. Compared with the characteristic extraction, the selected characteristic has more intuition; feature selection is carried out on the original features of the Chinese product reviews through RBM, and then review vectors generated by feature words are input into SVM for sentiment classification. Experimental results show that the method has high precision.

Drawings

FIG. 1 is a flow chart of the RBM-SVM algorithm;

FIG. 2 is a bag-of-words text vector representation;

FIG. 3 is a word space after RBM reconstruction;

FIG. 4 shows the comparison before and after dimension reduction of a certain comment information feature selection.

Detailed Description

The specific implementation process of the invention is as follows:

step 1: 1706 comment texts in a Chinese comment corpus related to electronic products are extracted from the data hall, wherein 853 positive comments and 853 negative comments exist.

Step 2: the 1706 comment texts are subjected to Chinese word segmentation by using an accurate mode in the Chinese word segmentation, and the part of speech is judged. Stop words are removed while part-of-speech filtering rules are defined. Defining part-of-speech filtering rules is:

a. establishing a part-of-speech set expressing the tendency of emotion: { noun, adjective, adverb, and conjunctive },

b. and judging the part of speech of each word in the comment, deleting the word if the part of speech of a certain word is not in the part of speech set, and otherwise, keeping the word.

And step 3: using the bag of words model in natural language processing, if the word appears in the comment, it is set to 1, otherwise it is set to 0, which is used as the word vector representation of the text. The length of the vector is the size of the dictionary of extracted words. Such as: there are two comments d1 and d2, and the two comments are processed to extract nouns, adjectives, adverbs, conjunctions, etc. in the two comments, which are marked as w1, w2, w3, w4 and w5 respectively. The word vectors represented by the bag of words model are as follows:

and 4, step 4: a two-layer Restricted Boltzmann Machine (RBM) is established, the number of nodes of the visible layer is 476, and the number of nodes of the hidden layer is 50. The individual visible cells within the visible layer are independent of each other. The node i of the visible layer is connected with the node j of the hidden layer, and the connection weight value is W_ji. The confined boltzmann machine has a joint configuration of energy between the visible layer and the hidden layer, which can be expressed as:

where i is the number of nodes of the visible layer unit, j is the number of nodes of the hidden layer unit, and θ ═ w_ij,a_i,b_jIs the model parameter, w_ijIs the connection weight value, a, between the node i of the visible layer and the node j of the hidden layer_iAnd b_iThe offsets of the visible layer and the hidden layer, respectively. Their joint probability distribution is:

whereinThe joint probability is to find the sample probability of the hidden layer and the visible layer.

Because in RBM there is no direct connection between hidden layer units, i.e. hidden layer nodes are conditionally independent, i.e.The data expectation of an unbiased sample of the data distribution is readily obtained. Given a randomly selected visible layer v, hidden layer element h_jThe probability of a binary state of 1 is:

wherein the function

At this time v_i,h_jIs an unbiased sample, and this function, like the function in the neural network, is a conditional bernoulli distribution. Since there is also no direct connection between visible layer cells, it is easy to get a sample of visible cells, and the probability of a binary state of 1 at a given visible layer cell is:

and 5: and inputting the segmented Chinese comments into the RBM, wherein each word in the dictionary corresponds to one node of the visible layer. And (5) adopting RBM to perform unsupervised learning.

Step 6: RBM rapidly and unsupervised training is performed by a method called Contrast Divergence (CD). the Contrast Divergence (CD) in RBM is subjected to k-step Gibbs sampling, i.e. learning is performed by using k-CDgibbs. In the k-CDgibbs algorithm, a training sample or a random initial state of a visual node is used for initializing the state v of the visual node⁰Alternately, the following sampling is performed, where P (h | v)₀) Is h₀Wherein P (v | h)₀) Is v is₁And the rest are similar.

h₀～p(h|v₀),v₁～p(v|h₀) (3-13)

h₁～p(h|v₁),v₂～p(v|h₁) (3-14)

......,v_k+1～p(v|h_k) (3-15)

Assuming S as the training set, (v)₁，……,v_m,h₁,……h_n) Respectively, a visible node and a hidden node of the RBM. The specific implementation process of K-CD Gibbs is as follows, and parameters in the process are described as follows: Δ w_ijIs W_ijIncrement of weight, Δ a_jIs defined as offset a_jIncrement of, Δ b_iIs an offset b_iIncrement of (2)

And 7: initializing parameters of a Restricted Boltzmann Machine (RBM): aiming at the step 4, the weight W in the RBM is subjected to normal distribution with the mean value of 0 and the standard deviation of 1; offset a_jAnd b_jThe initial values are all 0, respectively. The learning rate was 0.01

Step 8, setting the iteration times as 100, and finally obtaining the weight W between the visible node i and the hidden node j by the CD-k gibbs algorithm_ijOffset a, offset a_jAnd b_iThe final value of (c). And p (v) is calculated with reference to the formula listed in step 4_i＝1|h)

And step 9: word space per comment according to p (v)_i1| h) for reconstruction. If v is to be_iIs less than or equal to 0.01, the visible node v is considered to be_iFailing to represent higher level structures, discarding v_iThe corresponding words. If v is_iIs greater than 0.01, v is_iThe difference value is obtained by subtracting the sum input value from the reconstructed value. If the corresponding difference value is smaller, the word is not changed after reconstruction, then what the Chinese word corresponding to the word is found out, and then the word can be used as a characteristic representation; on the contrary, if the corresponding difference is larger, the word is greatly changed after being reconstructed, so that the word cannot be used as the characteristic representation of the comment. Therefore, according to the principle, the difference values are arranged from small to large, and the first 30 are selected as feature words. If the number of visible nodes with reconstruction values larger than 0.01 in a certain comment is smaller than 30, all the words corresponding to the visible nodes are reserved as the characteristic words. The word vector space of the comment is reduced through feature selection.

Step 10. the number of Gibbs samples K we selected experimentally. In the experiment, 1706 comment texts were randomly selected according to the training set and the test set in a ratio of 3: 1. The node of the hidden layer of the RBM is set to be 2, and is represented by 0 and 1, wherein 0 represents that the judged emotion of the product comment is negative at the moment, namely, the attitude that the user holds the product negatively, 1 represents that the judged emotion of the product comment is positive at the moment, namely, the attitude that the user holds the product positively, the RBM is used as a cluster, and the emotion types of the comment are output, wherein K is set to be 5,10,15,20 and 25 respectively. With the change of K, RBM measures different recognition results of product reviews by 3 indexes of accuracy, recall rate and F mean.

TABLE 1 different recognition results of RBM on product reviews

The experimental result shows that when K is 10, the F mean value is the highest, which shows that the comprehensive performance is the best, so K is 10.

Step 11: the comments subjected to dimension reduction are input into a Support Vector Machine (SVM) for emotion classification. The learning strategy of the linear classifier with the maximum interval on the SVM feature space is to maximize the interval and finally convert the interval into the solution of a convex quadratic programming problem. The objective function of the SVM is as follows:

<math><mrow> <mi>s</mi> <mo>.</mo> <mi>t</mi> <mo>.</mo> <msubsup> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </msubsup> <msub> <mi>α</mi> <mi>i</mi> </msub> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>0</mn> <mo>;</mo> <mi>C</mi> <mo>&GreaterEqual;</mo> <mi>α</mi> <mo>></mo> <mo>></mo> <mn>0</mn> <mo>,</mo> <mi>i</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mo>...</mo> <mo>.</mo> </mrow></math>

solving the objective function to obtain:

according to

Computing

Selection of alpha^*A component ofSatisfy the requirement of

And calculate out according to the above

Wherein α ═ { α ═ α₁……α_N}；K(x_i,x_j) For Chinese comments x_iAnd x_jKernel function of y_iFor Chinese comments x_iC is a penalty factor, and we choose the value of C to be 0.0001 in the experiment.

In SVM, Chinese comment x₁And x₂The kernel function we use is the radial basis function RBF, see the following equation. And mapping the feature space from a low-dimensional space to a high-dimensional space through a kernel function.

Step 12: comparing the emotion analysis method combining the RBM and the SVM with a method of singly using the SVM or the RBM, and calculating 3 indexes of accuracy, recall rate and F mean value according to the output result and the result of manual marking. The results of the experiment are shown in the following table:

TABLE 2 identification results of different methods

As seen from Table 2, the combination of RBM and SVM greatly improves the emotion analysis accuracy. Compared with the single use of the SVM, the combination of the RBM and the SVM has the advantages of improving the accuracy and the recall rate. Compared with the RBM used alone, the combination accuracy of the RBM and the SVM is greatly improved, although the recall rate reduces the point, the F mean value is much higher than that of the RBM used alone, and the comprehensive performance is good in the emotion analysis method of the combination of the RBM and the SVM.

Claims

1. A Chinese text sentiment analysis method based on a computer information processing technology is characterized by comprising the following steps:

3) the restricted Boltzmann machine is trained by a contrast divergence method, relevant parameters of the restricted Boltzmann machine are initialized, and after 100 iterations, a hidden layer node h is obtained_i1 conditional probability p (h) in the visible layer v_j1/v) and a visual layer node v_iConditional probability p (v) under hidden layer h of 1_i1/h); wherein,

4) according to the visible layer node v_iConditional probability p (v) under hidden layer h of 1_i1/h) to a visible layer node v_iCarrying out reconstruction; if v is_iIs less than or equal to 0.01, the visible node v is considered_iFailing to represent higher level structures, discarding v_iCorresponding words; if v is_iIs greater than 0.01, v is_iSubtracting the reconstructed value and the input value to obtain a difference value; setting the maximum emotional feature number of each comment as 30, and arranging the difference values from small to large, wherein the first 30 words v_iThe difference value of (a) is smaller than that of the following word, indicating that v_iNo change occurs after reconstruction, and then v is found out_iCorresponding Chinese words are taken as characteristic words; the difference between the later word and the previous 30 words is larger, which indicates v_iAfter reconstruction, if the result is changed greatly, v is converted_iCorresponding as a non-characteristic word; if the number of visible nodes with reconstruction values larger than 0.01 in a certain comment is smaller than 30, all words corresponding to the visible nodes are reserved as feature words; the reserved feature words jointly form a vector space of each comment;

2. The method for analyzing Chinese text sentiment based on computer information processing technology according to claim 1, wherein in the step 1), the comment vector is generated by: and (3) adopting a word bag model in natural language processing, if a certain word appears in a certain comment, setting the corresponding weight value of the word in a comment vector to be 1, otherwise, setting the value to be 0, and setting the length of the comment vector to be the size of a dictionary for extracting the word.

3. The method of emotion analysis for chinese text based on computer information processing technology as claimed in claim 2, wherein the number of nodes in the visible layer is 476, and the number of nodes in the hidden layer is 50.