CN105005553A

CN105005553A - Emotional thesaurus based short text emotional tendency analysis method

Info

Publication number: CN105005553A
Application number: CN201510342473.4A
Authority: CN
Inventors: 张海仙; 章毅
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2015-06-19
Filing date: 2015-06-19
Publication date: 2015-10-28
Anticipated expiration: 2035-06-19
Also published as: CN105005553B

Abstract

An emotional thesaurus based short text emotional tendency analysis method is disclosed. At first, a basic emotional thesaurus is constructed based on a word frequency statistics method; and the emotional tendency is judged by statistical correlation calculation of candidate words and vocabularies in the basic emotional thesaurus so as to expand the basic thesaurus. Then on the basis of the emotional thesaurus, each evaluative utterance S is taken as a unit, each emotional word WS in the utterance is taken as a separator, a phrase (WSi-1, WSi) between every two separators is subjected to emotional weight calculation, the weight values of all phrases are subjected to weighted summation to obtain the total emotional tendency value [weight (S)] of S, the emotional polarity of S is judged, and if the weight (S) is greater than 0, a comment belongs to positive comments; or otherwise, S is considered to belong to negative comments, so that polarity classification of the evaluative utterances is realized, and the phrase (WSi-1, WSi) contains words WSi but does not contain WSi-1.

Description

Based on the short text Sentiment orientation analytical approach of sentiment dictionary

Technical field

The present invention relates to short text and carry out Sentiment orientation sorting technique field, provide a kind of short text Sentiment orientation analytical approach based on sentiment dictionary.

Background technology

From the proposition of the Internet community concept more than ten years till now, the correlation technique that the researcher of various countries detects the Internet community and research give a lot of concern, achieve a lot of substantial progress.

First researcher has carried out more deep analysis to the topological structure of internet.Different from the imagination of people, internet and other a lot of networks interrelated be not exclusively random, can not describe the structure of the Internet community completely by Random Graph.Especially, after analyzing more and more internet data, the concept of Random Graph structure is subject to serious impact.The practical structures of internet is imagined more complex more than us, and the relation between link, website, the page, user, supvr is also diversified.The contact having a lot of intra-zone to contact closely with outside in internet is more weak, and these regions are exactly the Internet community, and the architectural feature of the Internet community cannot be described clearly by Random Graph.

Along with the proposition of the Internet community concept and the deep expansion of correlative study, developers devise various dissimilar the Internet community detection algorithm and carry out structure detection to it, and experimentally result is constantly improved algorithm and optimized.Along with carrying out in a deep going way of research, the algorithm detected the Internet community is also in constantly optimised improvement.

Compared with classic method, present algorithm can fully take into account the concurrency of the network operation mostly, real-time and extensibility etc. solve restriction physically.The ant that the use that the people such as such as Sadi propose walks abreast looks for the community detection method of circle.Like this can under the prerequisite not affecting result effect, compression the Internet architecture figure until a stable size is to reduce the cost of algorithm operation, thus completes the process to large scale network.What also have the people such as Leung to propose improves label propagation algorithm, and the method adding heuristic education carries out real-time community's detection to large scale network.For different the Internet community detection methods, the people such as Leskovec carry out research to existing certain methods and compare, find that large scale network community test problems is not that a simple algorithm just can solve, be a very complicated problem, many-sided problems such as network structure, Data distribution8, network crawl effect be considered.Along with the continuous maturation of detection technique, availability is detected also in continuous lifting in the Internet community, compares with traditional brute-force algorithm, and community's detection technique has more and more become a kind of art.Detect an emerging direction as community, will excavate the Internet architecture and make tremendous influence.

Summary of the invention

The object of the present invention is to provide a kind of short text Sentiment orientation analytical approach based on sentiment dictionary.

Based on the short text Sentiment orientation analytical approach of sentiment dictionary, it is characterized in that comprising the steps:

Step 1, structure sentiment dictionary, based on the method basis of formation sentiment dictionary of word frequency statistics; By SO-PMI method, the Sentiment orientation differentiating it is calculated to the statistic correlation of vocabulary in candidate's word and basic sentiment dictionary, thus expands basic dictionary.

The model of step 2, structure sentiment analysis, on the basis of sentiment dictionary, evaluate in units of statement S by every bar, with each emotion word WS in this statement for separator, to the punctuate phrase (WSi-1 between two separators, WSi) emotion weight computing is carried out, then the weights weighted sum of each punctuate is drawn the overall emotion propensity value weight (S) of S, judge that the method for the feeling polarities of S is: if weight (S) is greater than 0, then this comment belongs to front comment; Otherwise think that S belongs to negative sense comment, thus realize the polarity classification to evaluating statement, punctuate phrase (WSi-1, WSi) comprises word WSi, but does not comprise word WSi-1.

In technique scheme, SO-PMI method comprises the steps:

The part of speech property of word is obtained after step 2-1, employing ICTCLAS system participle,

Step 2-2, calculate by word.propertyal ∈ a, ad, an, ag, al} and word.propertyal ∈ the SO-PMI value of two kinds of candidate word word that vn, vd, vi, vg, vl} limit, the candidate word of all the other parts of speech is directly regarded as neutral word;

The SO-PMI value calculating two kinds of candidate word word is specially:

PMI value between calculated candidate word and forward basis emotion word, the PMI value between calculated candidate word and negative sense basis emotion word, finally both are subtracted each other the SO-PMI value obtaining candidate word, the calculating formula of SO-PMI is as follows:

S O - P M I (w o r d) = \underset{p o s W o r d &Element; p o s W o r d s}{Σ} P M I (w o r d, p o s W o r d) - \underset{n e g W o r d &Element; n e g W o r d s}{Σ} P M I (w o r d, n e g W o r d)

(formula 1)

PosWords is forward basis sentiment dictionary, and negWords is negative sense basis sentiment dictionary, and word is candidate's word;

Relation between the value of SO-PMI and Sentiment orientation as shown in the formula:

(formula 2)

Step 2-4, by the synonym of basis, front emotion word, and meet formula word.propertyal ∈ { a, ad, an, ag, al} or formula word.propertyal ∈ { vn, vd, through formula 2, vi, vg, vl} are also judged to be that the emotion word that front is inclined to is added to posWords;

Step 2-5, by the synonym of negative basic emotion word, and meet formula word.propertyal ∈ { a, ad, an, ag, al} or formula word.propertyal ∈ { vn, vd, vi, vg, through formula 2, vl} is also judged to be that the emotion word of negative tendency is added to negWords, obtain a comprehensive emotion word comparative sample.

The present invention, because adopt above technical scheme, therefore possesses following beneficial effect:

Our experimental result shows, when data set comprises 100,000 comment on commodity, 67.9% and 83.27% is respectively merely based on machine learning and the simple accuracy rate based on the Sentiment orientation analytical approach of sentiment dictionary, and the accuracy rate of comprehensive method in this paper can reach 85.9%, effect is much better than the method based on machine learning, is also better than merely based on the method for sentiment dictionary.

Embodiment

The invention provides a kind of short text Sentiment orientation analytical approach based on sentiment dictionary.

The construction method of sentiment dictionary

Sentiment dictionary refers to a series of set can expressing the word of mankind front or negative emotions.For ease of calculating the Sentiment orientation value that comment on commodity short sentence quantizes, also in sentiment dictionary, preserve its Sentiment orientation value for each word herein below, wherein, the positive emotion that+1 representative is the strongest, the negative emotion that-1 representative is the strongest.

We comprise two parts by the sentiment dictionary construction method of design: based on the method basis of formation sentiment dictionary of word frequency statistics; Based on the SO-PMI method improved, by calculating to the statistic correlation of vocabulary in candidate's word and basic sentiment dictionary the Sentiment orientation differentiating it, thus expand basic dictionary.

The structure of basis sentiment dictionary

Basis sentiment dictionary is basis and the key of carrying out short text sentiment analysis based on natural language processing method.Whether this problem appears at according to the word in corpus among sentiment dictionary, and the Sentiment orientation value appearing at the word among dictionary is to calculate the Sentiment orientation value of comment on commodity short sentence.So include which word in sentiment dictionary, whether the word in dictionary is representative in commodity evaluation field, whether the Sentiment orientation value of these words is accurate, and these problems all can impact the accuracy of emotional semantic classification result.The first step addressed these problems sets up basic sentiment dictionary accurately exactly.

The common method of basis of formation sentiment dictionary is: choose a series of emotion word from knowing net (Hownet), they are inputed to Google search engine one by one, the size of the click volume (hits value) returned according to Google sorts to emotion word, chooses emotion word based on several the highest emotion word of click volume.Corpus due to this problem only comes from the information on commodity comment in e-commerce website, so know that the word finder scope for this problem in net is excessive.Further, the click volume of search engine feedback can not reflect whether representative a vocabulary is evaluated in corpus at commodity.So the method is unsuitable for this problem.

This problem adopts the method based on word frequency statistics, semi-automatically chooses basic emotion vocabulary.Because the word that commodity are evaluated containing emotion composition in short text is mostly adjective, verb and a small amount of noun, so after carrying out pre-service, only need based on the abundant comment on commodity short sentence set of number of entries, automatic word frequency statistics is carried out for adjective, verb and noun, then for the higher some vocabulary of word frequency, choose by hand 20 the highest positive emotion words of word frequency and the highest 20 the negative emotion words of word frequency, be made up of the basic sentiment dictionary of this problem them.

Adopt said method, we finally include the front of basic dictionary and negative emotion vocabulary in table 1.

Table 1: basic sentiment dictionary

Because basic emotion vocabulary have expressed very strong feelings tendency, so we are the Sentiment orientation value that forward basis emotion word is given is+1, the Sentiment orientation value of giving for negative sense basis emotion word is-1.

The expansion of sentiment dictionary

The vocabulary of basis sentiment dictionary is very little, can not be included in commodity and evaluate all vocabulary with Sentiment orientation occurred in corpus.Therefore, need to expand basic sentiment dictionary, build relatively complete sentiment dictionary.Our extending method has two kinds: add synonym, add candidate word with Sentiment orientation.

Add synonym

Evaluate in short text at commodity, have the word synonym all each other much praised or belittle.So, expand synonym and us can be helped more broadly to identify emotion vocabulary.For this reason, we wish to utilize Harbin Institute of Technology's Chinese thesaurus [33], carry out synonym expansion to basic sentiment dictionary.But, have a lot of synonym to be the word of very writtenization in Harbin Institute of Technology's Chinese thesaurus, evaluate in corpus at commodity and can not use completely, the synonym " of inferior quality " of such as " bad ".In order to improve the algorithm performance that Sentiment orientation calculates, we still need artificial screening to go out conventional thesauarus.After synon expansion, the word of sentiment dictionary increases to 256.Because be synonym, the synon Sentiment orientation value of positive emotion words all in basic sentiment dictionary is set to+1 by us, and the synon Sentiment orientation value of all negative emotion words is set to-1.

Add relevant emotion word

Although build, completely exhaustively sentiment dictionary is very difficult, and concentrate the correlativity of emotion vocabulary in each word and dictionary by analyzing language material, dictionary included in the word that correlativity is very high, effectively can build the wider sentiment dictionary of coverage rate.This problem uses a kind of Statistics-Based Method: some mutual information method (Pointwise Mutual Information) carrys out the correlativity of emotion vocabulary in calculated candidate word and dictionary, thus judges whether this word should as emotion word.If so, then sentiment dictionary is added into.

Point mutual information method calculates the correlativity between word and word based on Mutual Information Theory.Its basic thought is: add up two word word _iand word _jthe probability of co-occurrence in statement is evaluated at commodity.The probability of co-occurrence is larger, then represent that the correlativity between these two words is higher, shown in lower:

(formula 5-1)

Wherein p (word _i∧ word _j) be word _iand word _jthe probability of co-occurrence in corpus, its computing method are such as formula shown in (6-1), and wherein n represents the total number of comment on commodity in corpus, numSentence (word _i, word _j) represent comprise word simultaneously _iand word _jevaluation number.P (word _i) and P (word _j) represent in corpus respectively and comprise word _iand word _jthe ratio of evaluation number shared by total evaluation number.Their computing method such as formula shown in 6-2 and 6-3, wherein numSentence (word _i) represent in corpus and comprise word _ievaluation number.PMI (word in formula (6-1) _i, word _j) represent and work as word _iand word _jduring one of them occurrences, the quantity of information of another variable that we can get, this has fully showed word _iand word _jbetween statistic correlation: when PMI is greater than 0, represent that two words have correlativity, and PMI value is larger, correlativity is stronger; When PMI equals 0, represent that between these two words be statistical iteration; When PMI is less than 0, represent that between these two words be mutual exclusion.

(formula 6-1)

P ({word}_{i}) = \frac{n u m S e n t e n c e ({word}_{i})}{n}

(formula 6-2)

P ({word}_{j}) = \frac{n u m S e n t e n c e ({word}_{j})}{n}

(formula 6-3)

When the principle of PMI is applied to feeling polarities analysis by us, just develop into SO-PMI algorithm.SO-PMI adopts the statistic correlation between the thought calculated candidate word of PMI and the basic emotion word of each group, from the Sentiment orientation of each group of this word of statistic correlation comprehensive descision.Concrete calculation procedure is: first, the PMI value between calculated candidate word and forward basis emotion word; Then, the PMI value between calculated candidate word and negative sense basis emotion word; Finally both are subtracted each other the SO-PMI value obtaining candidate word.Suppose that forward basis sentiment dictionary is posWords, negative sense basis sentiment dictionary is negWords, then for the calculating of candidate's word word, SO-PMI such as formula shown in 6-4:

S O - P M I (w o r d) = \underset{p o s W o r d &Element; p o s W o r d s}{Σ} P M I (w o r d, p o s W o r d) - \underset{n e g W o r d &Element; n e g W o r d s}{Σ} P M I (w o r d, n e g W o r d)

(formula 6-4)

Relation between the value of SO-PMI and Sentiment orientation is such as formula shown in 6-5:

(formula 6-5)

When SO-PMI method being applied to the commodity evaluation corpus of this experiment, we have found following problem:

1) a lot of individual character verb and exclusive noun itself are neutral implications, but they may be very large with the probability of a certain emotion word co-occurrence in dictionary in corpus, thus cause SO-PMI greatly to depart from neutral value.Such as verb " hits ".PMI value in it and dictionary between the word of front is 18.97, and and PMI value between negative word be 0, therefore its SO-PMI value can much larger than 0.Noun " thinkpad " also there will be similar situation.These situations can include the word that much there is no Sentiment orientation in sentiment dictionary, cause the performance cost that sensibility classification method is meaningless, and the accuracy of infringement classification.

2) SO-PMI of a lot of neutral words often accurately can not equal 0: they may be close in 0, also and may have very large deviation between 0.So the threshold values being front or negative emotion word by differentiation word is decided to be 0, and is not suitable for this problem.

3) problem is omitted: the corpus adopted due to this problem is the commodity evaluating data of short text form, the number of words of comment is often less, the quantity of basis emotion word is also few, so the probability of candidate word and basic emotion word co-occurrence can be lower, namely the value of SO-PMI can be tending towards 0.But from the visual angle of sentiment analysis, the correlativity between this candidate word and basic emotion word is very large again.The vocabulary much should including sentiment dictionary in can be caused like this to be missed, and to produce the Sparse Problems of feature.

So, need the feature according to this problem corpus, adaptive improvement is carried out to SO-PMI algorithm, in the hope of solving above-mentioned three problems.The improvement that we propose has following three places.

1) for problem 1, after ICTCLAS participle, obtain the part of speech property of word, and regulation only calculates the SO-PMI value of the two kinds of candidate word word limited by formula 6-7 and formula 6-8, the candidate word of all the other parts of speech is directly regarded as neutral word.

Word.propertyal ∈ { a, ad, an, ag, al} (formula 6-7)

Word.propertyal ∈ { vn, vd, vi, vg, vl} (formula 6-8)

Because adjective generally all contains emotion tendency, so all calculate SO-PMI value to all adjective vocabulary.

Meanwhile, we have given up the word of noun and all the other parts of speech, because the word of these parts of speech seldom can with Sentiment orientation.Because the singularity of corpus evaluated by commodity, most noun is all typonym or the brand name of commodity, such as " clothes ", " Mei Di " etc.So in order to prevent these neuters from being brought in sentiment dictionary mistakenly, in order to improve the efficiency of algorithm expanding candidate word, we do not calculate the SO-PMI value of noun yet.But a series of noun and synonym thereof that comprise intense emotion tendency can manually join in sentiment dictionary by we.The part nominal emotion word added by hand is as shown in table 2.

The hand picked nominal emotion word of table 2:

2) for problem 2, after to the observation of mass data, the value of SO-PMI and the relation of Sentiment orientation are readjusted as formula (6-9) by we.

(formula 6-9)

For problem 3: we select posWords and the negWords dictionary in further expansion type 6-4.Specific practice is: (a) by the synonym of basis, front emotion word, and meets formula 6-7 or formula 6-8 and be judged to be that the emotion word that front is inclined to is added to posWords through formula 6-9; B () by the synonym of negative basic emotion word, and meets formula 6-7 or formula 6-8 and is judged to be that the emotion word of negative tendency is added to negWords through formula 6-9.Like this, just provide a more fully emotion word comparative sample to candidate word, avoid omitting the candidate's word with Sentiment orientation.

Strictly, posWords is defined as follows:

1) if w is the front word in basic sentiment dictionary, so w posWords;

2) if w is the synonym of certain front word in basic sentiment dictionary, so w posWords;

3) if w meets formula 6-7 or formula 6-8, and 1.36<SO-PMI (word) <23, so w posWords.In like manner, negWords is defined as follows:

1) if w is the negative word in basic sentiment dictionary, so w negWords;

2) if w is the synonym of certain negative word in basic sentiment dictionary, so w negWords;

3) if w meets formula 6-7 or formula 6-8, and-16<SO-PMI (word) <-1, so w negWords.

According to the SO-PMI algorithm improved, we using the comment data of 100,000 after word segmentation processing as input, to wherein meeting formula 6-7 or formula 6-8, and the candidate word that basic emotion word of getting along well repeats carries out the calculating of Sentiment orientation value, pick out the candidate word meeting formula 6-9, this word is added to sentiment dictionary together with its Sentiment orientation value.Now, SO-PMI unavoidably can the polarity misclassification of some emotion word, so need manually to carry out denoising.After completing expansion, in dictionary, the number of emotion word is increased to 2393, wherein comprises forward emotion word 1302 and negative sense emotion word 1091.This completes the structure of the sentiment dictionary of this problem.

The design of emotion model

This section will introduce we to carry out sentiment analysis model to information on commodity comment, i.e. emotion model in detail.Its main thought is: on the basis of sentiment dictionary, evaluate in units of statement S by every bar, with each emotion word WS in this statement for separator, to the punctuate phrase (WSi-1 between two separators, WSi) emotion weight computing is carried out, then the weights weighted sum of each punctuate is drawn the overall emotion propensity value of S, thus realize the polarity classification to evaluating statement.Arrange herein, punctuate phrase (WSi-1, WSi) comprises word WSi, but does not comprise word WSi-1.

This model is made up of 6 modules, is respectively: the analysis of the analysis of the analysis of the analysis of emotion word, the analysis of negative word, adverbial word, the analysis of regular collocation words and phrases, adversative, confirmative question and exclamative sentence.

In the design process of these 6 modules, herein traditional emotion model [is all carried out to the different transformation of degree, made it the Sentiment orientation analysis being suitable for comment on commodity short text in e-commerce website.Such as, in the analysis of emotion word, we consider the different parts of speech of emotion word to the impact of Sentiment orientation, introduce Ad dictionary and special processing is carried out to the word with adjective and adverbial word two kinds of parts of speech; And for example, in the analysis of regular collocation words and phrases, we consider the emotion of regular collocation on sentence or emotion word affects, and is divided into 4 kinds, and all made corresponding special processing to often kind of collocation phrase, the result that Sentiment orientation is classified is more accurate.

The analysis of emotion word

As follows to the analysis process of emotion word: for each word word in comment to be analyzed all, scanning sentiment dictionary, judges whether word is present among sentiment dictionary, if exist, then word be considered as emotion word and from sentiment dictionary, read the Sentiment orientation value of this word, being returned; If do not exist, then word is considered as neutral vocabulary, returns 0.Such circulation is until judged the word of whole comment collection.This process is realized by algorithm analyzeSentimentWord.

But, some has the word of adjective and adverbial word two kinds of parts of speech, comprises emotional attitude in some cases, but only has the effect of adverbial word in yet some other cases, now, whether to be present among sentiment dictionary as criterion to the emotion propensity value calculating it using this word is inaccurate.

This situation is distinguished into following two kinds by this problem.

1) when a word has adjective and adverbial word two kinds of parts of speech, according to the difference in functionality of word in different statement, ICTCLAS tool analysis part of speech out also can be different, sees example 6.1.So in conjunction with the part of speech of word, we can judge whether this word is emotion word on the basis of sentiment dictionary.

Example 6.1

Sentence 1: taste/n very/d is general/a./wj

Now " generally " is as negative emotion vocabulary, and part of speech is a.

Sentence 2: general/ad not /d meeting/v goes offline/n./wj

Now " generally " is as the adverbial word modifying " going offline ", and represent the degree of strength of emotion, part of speech is ad.

For the problems referred to above, the solution that we propose is: set up an Ad dictionary, Ad dictionary is put in the word with adjective and adverbial word two kinds of parts of speech, and specifies: if word belongs to Ad dictionary, and its part of speech comprises character " d ", then this word is not considered as emotion word.

Through summing up, the Ad dictionary that we determine is as shown in table 6.3.

Table 6.3Ad dictionary

Good, many, really, especially, easily, strongly, completely, directly, substantially

2) some is had to the emotion word of adjective and adverbial word character, its adverbial word part of speech is just brought into use recently, as " little " in second example sentence in example 6.2.Now, ICTCLAS cannot analyze the adverbial word part of speech of this word, can only judge whether it has adverbial word part of speech by front and back collocations.

Example 6.2:

Sentence is 1: very/d little/a /ude thing/n./wj

Now " little " is as negative sense emotion vocabulary, and part of speech is a.

Sentence is 2: little/a expensive/a./wj

Now " little " is as the adverbial word modifying " expensive ", and represent the degree of strength of emotion, part of speech is d, but its part of speech is still identified as a by ICTCLAS.

From example 6.2, when " little " and adjective is arranged in pairs or groups together time, it has adverbial word part of speech.The word meeting this rule also has " greatly ".So in conjunction with the part of speech of its next word, we can by judging whether it plays emotion word.Concrete rule is: if a next word simultaneously with the word of adjective and adverbial word part of speech is adjective (a), then it is not regarded as emotion word.

Suppose that the part of speech of word word is property, the part of speech of its next word is nextProperty.We represent the Sentiment orientation value (also claiming weight) of vocabulary word with weight (word), represent that whether word is for emotion word with isSentiment (word).Emotion word analytical algorithm false code is described below:

Algorithm: analyzeSentimentWord (emotion word analysis)

Input: word, property, nextProperty

Export: weight (word), isSentiment (word)

if(isInSentimentLexicon(word))then

if(isInAdLexicon(word)&&property.contains(―d‖))then

weight(word):＝0；

Else if ((word==" greatly " || the little ‖ of word==-) & & nextProperty.contains (-a ‖)) then

weight(word):＝0；

else

weight(word):＝getWeightFromSentimentLexicon(word)；

end if

else

weight(word):＝0；

end if

if(weight(word)＝＝0)then

isSentiment(word):＝false；

else

isSentiment(word):＝true；

end if

In emotion word analytical algorithm, function isInSentimentLexicon and isInAdLexicon judges whether vocabulary is positioned among sentiment dictionary and Ad dictionary, and function getWeightFromSentimentLexicon obtains the Sentiment orientation value of vocabulary from sentiment dictionary respectively.

By the calculating of the Sentiment orientation value to each word, we get emotion word (namely weights are not equal to the word of 0) accurately, and have filtered the emotion word (i.e. weights equal 0 emotion word) not playing affectivity in particular statement.

The analysis of negative word

Negative word is the word representing negative implication, and its appearance can change the Sentiment orientation of former sentence.Such as in example 6.3, " liking " is that front is evaluated, and in time adding negative word " no " above, front is evaluated and just become unfavorable ratings.The negative word dictionary that this problem is determined, as shown in table 6.4, comprises altogether 40 negative words

Table 6.4 negative word dictionary

Except the situation that single negative word occurs, in Chinese, also often there will be double denial, namely occur even number negative word in a word.Such as in example 6.4, " cannot " and " no " be all negative word, they are modified emotion word simultaneously and " like ", finally reduce " liking " positive emotion tendency.

Example 6.3 I/rr not /d likes/vi it/rr./wj

Example 6.4 I/rr cannot/d not /d likes/vi it/rr./wj

Be: when emotion word Wsi appears in statement to calculate between Wsi and previous separator Wsi-1 the number negNum (Wsi-1, Wsi) of (namely one make pauses in reading unpunctuated ancient writings in) negative word herein to the analytical approach of negative word.If negNum is odd number, then the emotion value of this punctuate is the Sentiment orientation value negate of emotion word; Otherwise, then former Sentiment orientation value is kept.

The method calculating negative word number is: scan the vocabulary in sentence s one by one, and when scanning Wsi-1, using Wsi-1 as starting point, obtain word word one by one from front to back, call function isNegWord judges whether word is present in negative word dictionary.If so, then negNum increases one, until scan next emotion word Wsi.In the process, by word successively stored in array variable phrase (Wsi-1, Wsi), thus complete the intercepting to a punctuate.

The set of all punctuates in a comment short sentence might as well be designated as phrases, i-th punctuate is designated as phrases [i], the set of the Sentiment orientation value of all punctuates is designated as weight (phrases), the Sentiment orientation value of i-th punctuate is designated as weight (phrases [i]), and the number of the negative word comprised in punctuate is designated as negNum.Negative word analytical algorithm analyzeNegWord is herein described below:

Algorithm: analyzeNegWord (negative word analysis)

Input: comment short sentence s to be analyzed

Export: phrases, weight (phrases)

i:＝0；negNum:＝0；

foreach word win s

Phrases [i] .appendWord (w); // word w is joined the end of the sequence of terms of phrases [i]

if(isInNegLexicon(w))then

negNum++；

else

weight:＝analyzeSentimentWord(w)；

if(weight！＝0)then

if(negNum％2＝＝0)then

weight(phrases[i]):＝weight；

else

weight(phrases[i]):＝-1*weight；

end if

i++；negNum:＝0；

end if

end for

In negative word analytical algorithm, function isInNegLexicon judges whether vocabulary is positioned among negative word dictionary, and function analyzeSentimentWord obtains the Sentiment orientation value of the word that emotion word analytical algorithm returns.

The analysis of adverbial word

Adverbial word is the word of intensity of showing emotion." very " in such as " I is delithted with " have expressed strong positive emotion; " comparison " again such as in " I prefers " one word only express relatively weak positive emotion.According to the degree of strength of adverbs modify emotion word, adverbial word is divided into 4 classifications by us, for each classification distributes the numerical value that one represents emotion intensity.Through arranging, the adverbial word dictionary that this problem adopts is as table 6.5.

Table 6.5 adverbial word dictionary

Similar with the analytic process of negative word, we obtain the intensity of each adverbial word in punctuate from adverbial word dictionary, and using the product of the product of these intensity levels and the Sentiment orientation degree of the punctuate obtained before this as new punctuate Sentiment orientation degree.

Might as well represent with phrase certain punctuate that negative word analytical algorithm obtains, the weight representing the punctuate phrase that this algorithm obtains with weight (phrase), represents the intensity of adverbial word with degree.Adverbial word analytical algorithm analyzeAdvWord is herein described below:

Algorithm: analyzeAdvWord (adverbial word analysis)

Input: phrase, weight (phrase)

Export: weight (phrase)

degree:＝1.0；

for each word w in phrase

if(isInAdvLexicon(w))then

degree＝degree*getDegreeFromAdvLexicon(w)；

end if

end for

weight(phrase):＝degree*weight(phrase)；

In adverbial word analytical algorithm, function isInAdvLexicon judges whether vocabulary is positioned among adverbial word dictionary, and function getDegreeFromAdvLexicon obtains the emotion intensity of adverbial word from adverbial word dictionary.

The analysis of regular collocation phrase

We found through experiments, and there will be the specific collocation of some phrases in some comment words and phrases.Although these collocation comprise emotion word, this collocation can change this emotion word and lead to the emotion of whole statement; These collocation also may not comprise emotion word, but bring Sentiment orientation can to whole statement.So in both cases, it is inadequate for calculating emotion weights according to emotion weights, negative word and secondary contamination, also need to analyze regular collocation phrase.

Regular collocation phrase is divided into following four kinds herein.

1) regular collocation be made up of adverbial word (d) or conjunction (c), such as routine 6-5.We carried out before other described in this chapter analysis starts their analysis at regulation.

If example 6.5/c again/d is beautiful/a 1 point/m just/d is good/a/y./wj

Although contain the positive emotion vocabulary that " beautiful " and " good " is such in this sentence, regular collocation " if all right " brings negative emotion to the words.

We process in the regular collocation of algorithm for design matchAdvConjPatterns to adverbial word and conjunction.Detailed process is: based on the regular collocation rule set acPatterns of adverbial word and conjunction, judge whether evaluate statement S meets acPatterns with regular expression, if met, then aforementioned algorism is no longer adopted to calculate the Sentiment orientation value of making pauses in reading unpunctuated ancient writings in S, directly for S gives emotion weights.

Suppose that posWord represents certain forward basis emotion word, regular collocation rule has following 4 groups:

(1) if if if/again/more///a bit/many/can ... just ... + posWord+ ... }

(2). ... again/more/have ...

(3) just. ... too ... "

(4) need// obtain/after ./occupy ./sky/all/also/heavy. ... ...

Can be expressed as with regular expression:

(2) " best "+[u4E00-u9FA5] *+[" again " | " more " | " having "]+[u4E00-u9FA5] *

(3) " be exactly "+[u4E00-u9FA5] *+" too "+[u4E00-u9FA5] *

The false code of matchAdvConjPatterns algorithm can be described below:

Algorithm: matchAdvConjPatterns (regular collocation of coupling adverbial word and conjunction)

Input: S, acPatterns (the regular expression collection of the regular collocation rule of adverbial word and conjunction)

Export: weight (S)

if(acPatterns.match(S))then

weight(S)＝-0.5；

else

Weight (S) is calculated with the additive method described in this chapter;

end if

2) ambiguity emotion word regular collocation.Some emotion word, when together with different collocations, Sentiment orientation also can be different, or keep initial value, or negate, or be neutrality.Such emotion word is claimed to be ambiguity emotion word herein.Such as example 6.6 and example 6.7.

Example 6.6

Sentence 1: cost performance/n very/d is high/a./wj

Sentence is 2: price/n high/a/y./wj

" height " is forward emotion word in sentiment dictionary.In sentence 1, together, the Sentiment orientation value of whole sentence gets the former weights of emotion word in " height " and " cost performance " collocation; In sentence 2, when it and " price " are arranged in pairs or groups, then bring negative sense emotion.

Example 6.7

Sentence 1: quite/d is large/a /ude./wj

Sentence 2: large/a/y point/qt./wj

" greatly " is forward emotion word in sentiment dictionary.We find to be similar to " greatly " such adjective and " point " when arranging in pairs or groups, and its original Sentiment orientation can be reversed.

Similar with the aforesaid regular collocation process to adverbial word and conjunction, arrange two groups of rules about the regular collocation of ambiguity emotion word herein, first group makes the negate of Sentiment orientation value, and second group makes Sentiment orientation value reset.We, by according to different rules, recalculate the Sentiment orientation value of punctuate.First group of rule is designated as ambigNegPatterns, and second group of rule is designated as ambigZeroPatterns, represents certain negative emotion word with negWord, and ambiguity emotion word regular collocation rule is defined as follows.

AmbigNegPatterns comprises following 5 rules:

[u4E00-u9FA5] *+[" just " | " afterwards " | " again "]+[u4E00-u9FA5] *+" price reduction "+[u4E00-u9FA5] *

[u4E00-u9FA5] *+" price reduction "+[u4E00-u9FA5] *+" too fast "+[u4E00-u9FA5] *

[u4E00-u9FA5] *+" point "+[u4E00-u9FA5] *

" use "+[u4E00-u9FA5] *+" for a long time "

AmbigZeroPatterns comprises following 3 rules:

[u4E00-u9FA5] *+" temporarily "+[u4E00-u9FA5] *

[u4E00-u9FA5] *+" also useless "

[u4E00-u9FA5] *+" not knowing "+[u4E00-u9FA5] *+" how "

The false code of ambiguity emotion word analytical algorithm can be described below:

Algorithm: analyzeAmbigEmotionWord (analysis of ambiguity emotion word)

Input: phrase, weight (phrase)

Export: weight (phrase)

if(ambigNegPatterns.match(phrase))then

weight(phrase)＝-1*weight(phrase)；

else if(ambigZeroPatterns.match(phrase))then

weight(phrase)＝0；

end if

3) oppositely emotion word regular collocation.Some emotion word, current wearing passionately high-coloredly describes adverbial word, and during the adverbial word that namely weights are larger, Sentiment orientation can be inverted.We claim such emotion word to be reverse emotion word.

Example 6.8 too/d is large/a/y./wj

When emotion word " greatly " and " too " adverbial word that weights are larger is like this connected time, its positive emotion tendency can be inverted.

In order to analyze this type of regular collocation, we set up reverse sentiment dictionary as shown in table 6.6, store reverse emotion word, and suppose, the adverbial word that weight is greater than 0.5 can reverse the emotion value of reverse emotion word.

Table 6.6 is sentiment dictionary oppositely

Bright, greatly, easily, small and exquisite, light, in vain, simply, tightly, thin, gently, heavy, long, high

Can be described below the process that analyzing and processing is carried out in reverse emotion word regular collocation herein.

Scanning in the process of emotion word Wsi from last emotion word, using variable advWeight to record the weight of each adverbial word.When scanning the position of Wsi, namely advWeight stores the weight apart from the nearest adverbial word of Wsi.First judge whether advWeight is greater than 0.5, then judge whether occur negative word before Wsi, if there is not negative word, then judge whether Wsi belongs to reverse sentiment dictionary by function isOppositeWord (Wsi), if, just the emotion value of phrase (Wsi-1, Wsi) is reversed.If there is negative word before Wsi, so the emotion value of phrase (Wsi-1, Wsi) has remained unchanged.This is because in " negative word+adverbial word+emotion word " this combination, emotion word does not show as negative sense emotion word.Such as, in " not too easy " short sentence, positive emotion word " easily " is not reversed to negative sense emotion word by adverbial word " too " above.

Suppose that the number of the negative word that phrase (Wsi-1, Wsi) comprises is negNum (Wsi-1, Wsi), the Processing Algorithm of reverse emotion word regular collocation can be described below by false code:

Algorithm: matchOppositeEmotionPatterns (mating the regular collocation of reverse emotion word)

Input: weight (phrase (Wsi-1, Wsi)), advWeight, negNum (Wsi-1, Wsi)

Export: weight (phrase (Wsi-1, Wsi))

if((advWeight>0.5)&&(negNum(Wsi-1,Wsi)＝＝0)&&(isOppositeWord(Wsi)))thenweight(phrase(Wsi-1,Wsi))＝-1*weight(phrase(Wsi-1,Wsi))end if

4) regular collocation of negative word.Our regulation is analyzed it and is only carried out when whole comment S does not comprise emotion word.

In commodity evaluation field, we notice that some is evaluated statement and does not comprise any emotion vocabulary, only comprise negative word, such as example 6.9.If according to the method described in 6.2.2 joint, this statement will be identified as neutral statement.But we obviously can experience the negative sense emotion of this statement, and such emotion is passed on by negative word.

Example 6.9 electric fans/n all/d do not have/d! / wt

This sentence comprises two negative words (" no " and " not having "), but this Sentiment orientation is but negative sense.

Solution to the problems described above is: the regular collocation summing up several negative word, forms negPatterns rule set.

When scanning the end of S, first judging whether S meets negPatterns rule, if met, just the weights of S being set to 0.5.If the rule of not meeting, the parity that we recycle negative word is that S gives Sentiment orientation value.

At present, our negPatterns only comprises a rule, later extendible rule newly.

[" no " | " not having "]+[u4E00-u9FA5] *+" just "+[" not having " | "None"]+[u4E00-u9FA5] Processing Algorithm of regular collocation of * negative word can be described below by false code:

Algorithm: matchNegPatterns (regular collocation of coupling negative word)

Input: S, negNum (S)

Export: weight (S)

if((negPatterns.match(S))||(negNum(S)％2！＝0))then

weight(S)＝-0.5；

else

weight(S)＝0.5；

end if

The analysis of adversative

Adversative refers to the word that can bring reversion effect to the semanteme of sentence.

Example 6.11

Ratio/p market/n /ude1 is cheap/a ,/wd still/c sells/v after/f also/d do not have/v market/n is good/a./wj

What first short sentence represented is front emotion, however when adversative " but " occur after, sentence meaning emotion is then partial to negative.

The adversative dictionary of this Subject Design is as shown in table 6.7.

Table 6.7 adversative dictionary

Can be abstract in following structure by the statement comprising adversative:

Phrase (Wsi-1, Wsi)+punctuation mark+adversative+phrase (Wsi, Wsi+1)

According to the effect of adversative, the Sentiment orientation of known weight (phrase (Wsi-1, Wsi)) and weight (phrase (Wsi, Wsi+1)) should be contrary.

Be: after the analysis of emotion word, negative word and adverbial word that next emotion word Wsi+1 is found in scanning backward from current emotion word Wsi place to the analytic process of adversative.In this process, if scan adversative, then by weight (phrase (Wsi-1, Wsi)) negate, make phrase (Wsi-1, the Sentiment orientation of Sentiment orientation deflection adversative punctuate phrase (Wsi, Wsi+1) below Wsi).Suppose that phrases represents the set of all punctuates in comment short sentence, weight (phrases) represents these Sentiment orientation values of making pauses in reading unpunctuated ancient writings, numPhrases represents the sum of punctuate, phrases [i] represents i-th punctuate, isTransitionWord (word) is for judge whether word word belongs to the function of adversative dictionary, and the analytical algorithm of adversative can be described below by false code:

Algorithm: analyzeTransitionWord (analysis adversative)

Input: phrases, weight (phrases)

Export: weight (phrases)

for(i＝0；i<numPhrases-1；i+＝2)

for word in phrases[i+1]

if(isTransitionWord(word))then

weight(phrase[i])＝-1*weight(phrase[i])；

break；

end if

end for

The analysis of exclamative sentence and confirmative question

Exclamative sentence and confirmative question are all the sentence patterns increasing the weight of statement Sentiment orientation.Wherein, exclamative sentence only plays booster action, and confirmative question can also reverse Sentiment orientation.

For the analysis of exclamative sentence, we with exclamation mark "! " as the mark of exclamative sentence, it is designated as exc.The computing method of its emotion weights are: when scanning exclamation mark, and we find from back to front from the nearest emotion word Wsi-1 of exclamation mark, and using the weights of the Sentiment orientation value of Wsi-1 as exc.

Be different from traditional emotion model, our emotion model does not do special processing to confirmative question.This is because, evaluate short sentence at commodity and concentrate, the implication that the semanteme that great majority comprise antisense interrogative does not all reverse, but represent the query attitude to commodity.Such as in example 6.12, although " " is the representative word of disjunctive question, in this sentence, inreal reversion effect is played to Sentiment orientation.

Example 6.12

The inside/f unexpectedly/d has/vyou 60/m is many/m M/x /ude1 file/n ,/wd/d are /vshi second hand/n? / ww

Sentiment orientation value weighted calculation

Calculate the Sentiment orientation value of all punctuates that a comment S comprises.These Sentiment orientation values are added, just can calculate the Sentiment orientation value weight (S) of S.Judge that the method for the feeling polarities of S is: if weight (S) is greater than 0, then this comment belongs to front comment; Otherwise, think that S belongs to negative sense comment.

Claims

1., based on the short text Sentiment orientation analytical approach of sentiment dictionary, it is characterized in that comprising the steps:

Step 1, structure sentiment dictionary, based on the method basis of formation sentiment dictionary of word frequency statistics; By SO-PMI method, the Sentiment orientation differentiating it is calculated to the statistic correlation of vocabulary in candidate's word and basic sentiment dictionary, thus expands basic dictionary;

The model of step 2, structure sentiment analysis, on the basis of sentiment dictionary, evaluate in units of statement S by every bar, with each emotion word WS in this statement for separator, to the punctuate phrase (WSi-1 between two separators, WSi) emotion weight computing is carried out, then the weights weighted sum of each punctuate is shown that every bar evaluates the overall emotion propensity value weight (S) of statement S, judge that the method for the feeling polarities of every bar evaluation statement S is: if weight (S) is greater than 0, then this comment belongs to front comment; Otherwise think that every bar is evaluated statement S and belonged to negative sense comment, thus realize the polarity classification to evaluating statement, punctuate phrase (WSi-1, WSi) comprises word WSi, but does not comprise word WSi-1.

2. the short text Sentiment orientation analytical approach based on sentiment dictionary according to claim 1, it is characterized in that, SO-PMI method comprises the steps:

The SO-PMI value calculating two kinds of candidate word word is specially:

S O - P M I (w o r d) = \underset{p o s W o r d &Element; p o s W o r d s}{Σ} P M I (w o r d, p o s W o r d) - \underset{n e g W o r d &Element; n e g W o r d s}{Σ} P M I (w o r d, n e g W o r d)

(formula 1)

(formula 2)