CN106569996A

CN106569996A - Chinese-microblog-oriented emotional tendency analysis method

Info

Publication number: CN106569996A
Application number: CN201610898432.8A
Authority: CN
Inventors: 郝志峰; 梁礼欣; 蔡瑞初; 温雯
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2016-03-30
Filing date: 2016-10-14
Publication date: 2017-04-19
Anticipated expiration: 2036-10-14
Also published as: CN106569996B

Abstract

The invention discloses a Chinese-microblog-oriented emotional tendency analysis method. The method comprises the following modules: an undersampling technique module, a microblog text preprocessing module, Word2vec-utilized microblog extension module, a feature extraction module, an emotion analysis model training module, and an emotional tendency judgment module. Finally, a trained AWCRF model is utilized to perform emotional tendency judgment on a to-be-predicted microblog. The Chinese-microblog-oriented emotional tendency analysis method provided by the invention has the advantage of capability of effectively solving an emotional tendency classification problem when a Chinese microblog data set exits unbalanced emotional tendency distribution, is simple to implement and has high recognition rate and strong practical value and practical significance.

Description

A kind of Sentiment orientation analysis method towards Chinese microblogging

Technical field

The invention belongs to network information processing technical field, and in particular to a kind of Sentiment orientation towards Chinese microblogging is analyzed Method.

Background technology

Microblogging is liked by many users as a new social platform.More and more people like by microblogging come Their viewpoint is delivered, so fully analysis and the emotion in digging user microblogging are significantly.The mesh of sentiment analysis Be from the viewpoint of digging user in microblogging text and recognize its Sentiment orientation.For example, enterprise can be obtained by microblogging Evaluation of the user to their products ＆ services.With traditional sentiment analysis work, can be with to the sentiment analysis method of microblogging It is divided into two classes.One class is the method based on sentiment dictionary and rule, and they are according to positive emotion word in sentence and negative emotion word Number recognizing Sentiment orientation.Another kind of is the method based on machine learning, and they are trained by selecting suitable feature Grader.

However, the Sentiment orientation that above method all have ignored Chinese microblog data concentration is distributed disequilibrium to emotional semantic classification Impact, that is to say, that when the quantity of the sentence of the sentence and positive emotion of negative emotion in data set differs greatly, can shadow Ring the discriminant accuracy of grader.In real life, the topic that discusses in microblogging or event itself often with very strong Emotion tendency, it is uneven that this causes the Sentiment orientation of many topics to be distributed, such as " # edible oil rise in price # ", " # leather shoes fruit The topics such as jelly # " itself have obvious derogatory sense emotion, and " # slaughters cry of a deer prize-winning # " this topic has obvious commendation emotion. The disequilibrium of data Sentiment orientation distribution exactly causes many machine learning algorithms to show bad key factor, especially On the recognition effect of the classification occupied the minority in Sentiment orientation.In addition, compared with traditional text, the length of microblogging is general very Short, this causes traditional method to be difficult from wherein extracting the information that contributes to emotional semantic classification, and there is presently no one enough Big sentiment dictionary can cover all emotion words.

The content of the invention

In order to solve the above problems, the present invention proposes a kind of Sentiment orientation analysis method towards Chinese microblogging, its master Step is wanted to include as follows：

(1) Undersampling technique module.Many several classes of samples in training set are reduced using Affinity Propagation algorithms This quantity so as to balance training collection, to reduce impact of the disequilibrium of data set Sentiment orientation distribution to classifying quality.

(2) microblogging Text Pretreatment module.Microblogging text is cleaned, and carries out participle, part-of-speech tagging and stop words The operation such as process.

(3) microblogging module is extended using Word2vec.The front K phase of each word in microblogging is sought by using Word2vec Like word so as to extending microblogging.

(4) characteristic extracting module.The related dictionary of loading, carries out feature extraction to above pretreated microblogging.

(5) sentiment analysis model training module.AWCRF moulds are trained in training set after having balanced and extended above Type；

(6) emotion tendency discrimination module.Sentiment orientation is carried out to microblogging to be predicted using the AWCRF models for training Differentiate.

Description of the drawings

Fig. 1 is the analysis process figure of the present invention.

Specific embodiment

The present invention is described further below in conjunction with the accompanying drawings.The present invention is distributed unbalanced Chinese for Sentiment orientation The Sentiment orientation classification problem of microblog data collection.Fig. 1 is the total algorithm flow process of the present invention.

The particular content of each step is described separately below：

1st, Undersampling technique module

The present invention using Affinity Propagation algorithms come reduce the quantity of many several classes of samples in training set so as to Balance training collection.

The Undersampling technique of the present invention is divided into following several steps：

(1) give training set t₁, it is divided into into many several classes ofs maj₁With minority class min₁；

(2) for many several classes ofs maj₁, it is polymerized to into several classes using Affinity Propagation clustering algorithms, can be with It is expressed as C={ c₁,c₂,...c_n}；

(3) in order to build the data set of balance, maj is obtained from random choose sample in each subclass of C in proportion₂, make Obtain maj₂Sample size and min₁Sample size it is close；

(4) data set maj₂And min₁To be brought together and obtain a balance training collection t₂；

(5) by training set t after having balanced₂Replace t₁As final training set.

2nd, microblogging Text Pretreatment module

The module groundwork is that microblogging text is cleaned, and carries out participle, part-of-speech tagging and stop words process etc. Operation；

3rd, microblogging module is extended using Word2vec

The present invention seeks the front K similar word of each word in microblogging so as to extend microblogging by using Word2vec, specifically Step includes following two step, is training term vector and extension microblogging respectively.

(1) train term vector.We have collected substantial amounts of microblogging language material from Sina weibo API, filter out some symbols useless Number and network address, finally be left 5G microblog data be used for work as training set.Then using the Skip-gram models in Word2vec come Training term vector, seeks the similar word of each word in microblogging finally by the term vector.

(2) extend microblogging.First, give a microblogging sentence t_i, the word to this sentence can be obtained after its participle Sequence, is expressed as { w₁,w₂,...w_n, then, microblogging sentence t is sought using the term vector for having trained above_iIn each word Front k similar word, so as to reach extension microblogging purpose.Microblogging after extension can be expressed as { w₁,w₂,...w_n, w₁₁, w₁₂... w_1k,w₂₁,w₂₂,...w_2k,...,w_n2,...w_nk, wherein { w₁₁, w₁₂... w_1kRepresent word w₁Front k similar word, For the emoticon and punctuation mark in microblogging is directly retained in microblogging, so the microblogging after extension contains more than former microblogging Many information.

4th, characteristic extracting module

(1) the related dictionary of loading, including sentiment dictionary, feelings symbol dictionary, popular dictionary, negative word dictionary etc., are used for Feature extraction.

(2) by the dictionary data for having loaded above, pre-defined feature is carried out to pretreated microblogging text Extraction, by text vector and be converted into the form that sentiment analysis model training module can be processed.

5th, sentiment analysis model training module

The present invention by CRF models apply in the data after this paper Undersampling techniques and Word2vec technical finesses from And obtain AWCRF models.Then the characteristic vector for characteristic extracting module being extracted from microblogging as input, using L- BFGS algorithms are training AWCRF models.During the model can not only overcome training set, emotion is distributed unbalanced impact advantage, And have the advantages that to increase the impact inadequate so as to alleviate sentiment dictionary coverage rate of the emotion information of microblogging sentence.Separately Outward, as training sample tails off, so the features such as model also has few training time and high training effectiveness, with very strong reality With value.

6th, emotion tendency discrimination module

First data to be predicted are carried out with Text Pretreatment, is operated using Word2vec extensions, feature extraction etc., so as to Obtain surveying the characteristic vector of data；Then using the characteristic vector of prediction data as AWCRF mode inputs, using what is trained AWCRF models carry out Sentiment orientation differentiation to microblogging to be predicted.

Claims

1. a kind of Sentiment orientation analysis method towards Chinese microblogging, it is characterised in that include such as lower module：

(1) Undersampling technique module, reduces many several classes of samples in training set using Affinity Propagation algorithms Quantity so as to balance training collection, to reduce impact of the disequilibrium of data set Sentiment orientation distribution to classifying quality；

(2) microblogging Text Pretreatment module, cleans to microblogging text, and carries out the process of participle, part-of-speech tagging and stop words Deng operation；

(3) microblogging module is extended using Word2vec, the front K similar word of each word in microblogging is sought by using Word2vec So as to extend microblogging；

(4) characteristic extracting module；The related dictionary of loading, carries out feature extraction to above pretreated microblogging；

(5) sentiment analysis model training module, trains AWCRF models in the training set after having balanced and extended above；

(6) emotion tendency discrimination module, carries out Sentiment orientation to microblogging to be predicted using the AWCRF models for training and sentences Not.

2. method according to claim 1, it is characterised in that after the module step 1, also comprise the steps：

(2-1) give training set t₁, it is divided into into many several classes ofs maj₁With minority class min₁；

(2-2) for many several classes ofs maj₁, it is polymerized to into several classes using Affinity Propagation clustering algorithms, can be with table It is shown as C={ c₁,c₂,...c_n}；

(2-3) in order to build the data set of balance, maj is obtained from random choose sample in each subclass of C in proportion₂So that maj₂Sample size and min₁Sample size it is close；

(2-4) data set maj₂And min₁To be brought together and obtain a balance training collection t₂；

(2-5) by training set t after having balanced₂Replace t₁As final training set.

3. method according to claim 1, it is characterised in that after step 3, also comprise the steps：

(3-1) term vector is trained, we have collected substantial amounts of microblogging language material from Sina weibo API, filter out some symbols useless And network address, finally it is left the microblog data of 5G for working as training set.Then using the Skip-gram models in Word2vec instructing Practice term vector, the similar word of each word in microblogging is sought finally by the term vector；

(3-2) microblogging is extended, first, a microblogging sentence t is given_i, the word order to this sentence can be obtained after its participle Row, are expressed as { w₁,w₂,...w_n, then, microblogging sentence t is sought using the term vector for having trained above_iIn each word Front k similar word, so as to reach the purpose of extension microblogging.Microblogging after extension can be expressed as { w₁,w₂,...w_n, w₁₁, w₁₂... w_1k,w₂₁,w₂₂,...w_2k,...,w_n2,...w_nk, wherein { w₁₁, w₁₂... w_1kRepresent word w₁Front k similar word, For the emoticon and punctuation mark in microblogging is directly retained in microblogging, so the microblogging after extension contains more than former microblogging Many information.

4. method according to claim 1, it is characterised in that after the step 5, also comprise the steps：

(4-1) CRF models are applied in the data after this paper Undersampling techniques and Word2vec technical finesses so as to obtain To AWCRF models；

(4-2) characteristic vector that characteristic extracting module is extracted from microblogging is instructed as input using L-BFGS algorithms Practice AWCRF models.

5. method according to claim 1, it is characterised in that after the step 6, also comprise the steps：

(5-1) data to be predicted are carried out with Text Pretreatment, is operated using Word2vec extensions, feature extraction etc., so as to To the characteristic vector for surveying data；

(5-2) using the characteristic vector of prediction data as AWCRF mode inputs, using the AWCRF models for training to be predicted Microblogging carry out Sentiment orientation differentiation.