CN103544246A

CN103544246A - Method and system for constructing multi-emotion dictionary for internet

Info

Publication number: CN103544246A
Application number: CN201310470531.2A
Authority: CN
Inventors: 刘奕群; 马少平; 张敏; 金奕江; 张阔
Original assignee: Tsinghua University; Beijing Sogou Technology Development Co Ltd
Current assignee: Tsinghua University; Beijing Sogou Technology Development Co Ltd
Priority date: 2013-10-10
Filing date: 2013-10-10
Publication date: 2014-01-29

Abstract

Provided are a method and system for constructing a multi-emotion dictionary for the internet. The method includes the steps that internet text linguistic data are obtained from the internet; data preprocessing is carried out on the obtained text linguistic data to obtain candidate words of the emotional dictionary; new words are extracted from the obtained text linguistic data to obtain candidate words of the emotional dictionary; an undirected graph model is constructed by means of the obtained candidate words of the emotional dictionary; iterative computation is conducted on multiple motion scores of each node in an undirected graph by means of the undirected graph model and a label propagation algorithm to construct the emotion dictionary. According to the method and system, different seed words can be adopted to construct emotion dictionaries with different emotions, and therefore the results of emotion recognition are richer.

Description

The multiple sentiment dictionary construction method in internet and system

Technical field

The present invention relates to network information Intelligent treatment field, particularly relate to and utilize the emotion mood showing in internet text to construct the method and system of sentiment dictionary.

Background technology

Along with the development of internet, Social Media emerges in multitude.Social Media be take internet as medium, for user provides the intercommunion platform of sharing suggestion, experience, has collected a large amount of contents that user produces, and directly reflects people's mood, viewpoint and hobby.Content of text in Social Media comprises blog, micro-blog, forum's discussion, product review etc., is the carrier that user expresses individual emotion, and Social Public Feelings, brand reputation, product evaluation etc. are had a significant impact.Therefore, the text emotion analytical technology for these media becomes hot issue in recent years.Text emotion is analyzed this computer technology, is one section of expressed emotion tendency of text is identified.The emotional expression of people in text is very complicated in theory, except certainly (praising) with oppose (demoting), also may express happiness, indignation, grief, fear, the mood such as surprised.But the correlative study of computational linguistics at present is generally divided into commendation and derogatory sense by emotion tendency, sometimes also comprises neutrality or mixing etc.This degree be reduced at the needs that can meet to a certain extent people, have broad application prospects.

Therefore, the user feeling embodying in identification text, becomes a gordian technique in network information field, in business, politics, social event, plays an important role.For example, in the product review of e-commerce website, by automatically identifying consumer to product, be even appreciation to each attribute of product or criticize, can affect other consumers and make the buying behavior that is applicable to oneself, also can make goods producer find advantage and the deficiency of product, to promote its improvement; Film comment website, viewing person provides evaluation to each factors such as the story of a play or opera of film, performer, photographies, as identified it with automated process, passes judgement on tendency, can make comprehensive understanding to spectators' reflection of a film; In business, the public praise of the user of colony to a certain brand or commodity evaluation formation, one of user profile of businessman's attention, the evaluation that user is passed from mouth to mouth, the reputation of Hui Dui businessman impacts, businessman can expand the impact of product by marketing in internet media, induction user's consumer behavior; By catching the hot issue relevant to certain industry in microblogging, analyze its emotion trend, can predict the tendency of stock; In many political events, netizen utilizes internet as the platform of information transmission and data publish, the all reflections to some extent in microblogging of voter's tendency during so multinational election, different camps, therefore researcher utilizes relevant microblogging to carry out prior forecast or ex-post analysis, the impact of the Probe into Network will of the people on election.

Social Media text is exactly that language is lack of standardization, word free from the outstanding different of traditional media text.Traditional natural language processing method is carried out grammatical analysis to text conventionally, depends on linguistic knowledge.But for Social Media text, due to its text representation may not standard, grammatical, the accuracy of traditional analysis reduces greatly.The neologisms that and for example some users produce, are unexistent (i.e. " unregistered words ") in traditional dictionary, or word meaning great changes will take place, this is very limited the application of classic method.

The recognition result that text emotion is analyzed is the classification such as commendation, derogatory sense normally, so text emotion analysis can adopt the method for machine learning, as classification task, completes.In aforementioned comment on commodity or film comment website, user is conventionally furnished with scoring in comment, and this mark can be used as the marking of comment text emotion degree, i.e. the mark to comment text, therefore these comments and scoring can be used as corpus, for Supervised machine learning process.These methods are all usingd vocabulary (tuple) as feature, and combining classification device (as model-naive Bayesian, maximum entropy model, supporting vector machine model etc.) completes the training and testing of supervision.If lack sufficient corpus, supervised learning method has lost ample scope for abilities.For the huge internet text of this quantity of microblogging, adopt and manually can only mark microblogging text seldom, its suitable application area and scale are restricted.Copy the marking score value of comment website as tag along sort, in microblogging, can suppose emoticon in text (as smiley ":-) " or the face symbol of crying " :-(") represent its emotion tendency, with the appearance of this symbol, as tag along sort, train.But these emoticons often exist noise as tag along sort, and be subject to the restriction of symbol distortion, kind.Therefore, the emotional semantic classification based on supervised learning is subject to severely restricts, and unsupervised learning method based on sentiment dictionary still plays very important effect.

Sentiment dictionary refers to a dictionary that comprises emotion word and emotion tendency thereof.These emotion words be take adjective conventionally as main, express clear and definite emotion tendency, for example " good ", " bad " in word; " happiness ", " sadness " etc.In reality, artificial constructed sentiment dictionary is subject to the restriction of cost and scale, is unsuitable for promoting.And from corpus of text, can utilize the feature of text automatically to build sentiment dictionary.This automated manner conventionally from a small-scale emotion word subset (or rule), utilize afterwards connecting each other between word, expand gradually set, calculate the emotion tendency of more word.Automatically the process that builds sentiment dictionary mainly faces following problem:

Choosing of candidate's emotion word: emotion word majority is adjective, the therefore common only emotion word using adjective as candidate.For slightly complicated situation, can utilize Rule Extraction to go out abundanter emotion word or emotion phrase.

Tolerance lexical relation: for being diffused into large-scale word from small-scale emotion seed word (being called for short seed word), lexical relation should reflect the emotional connection between them.These contacts generally comprise: cooccurrence relation, and this is because commendatory term can occur with commendatory term more jointly, derogatory term can be more and derogatory term co-occurrence, so the cooccurrence relation in sentence can be set up contact between word; Or in employing sentence by conjunction (" with ", " with ", " but ") relation set up, although this mode quantity can not show a candle to the former, quality is higher; Deeper is semantic relation, as utilizes synonym, antonym relation of WordNet etc.

The propagation of emotion tendency: word and the contact between them have formed a figure, need to the emotion propensity score of seed word be propagated into more vocabulary with suitable computing method.For example, the figure building with synonym, antonymy, can be according to the type on these limits, by the word cluster of identical polar; With mutual information (point-wise mutual information, PMI), calculate the relation of neologisms and existing word.In the model based on figure, also can adopt figure to propagate the modes such as (graph propagation) or label propagation (label propagation) and complete.

These problems show, although use sentiment dictionary carries out the method for sentiment analysis, have avoided this bottleneck of corpus, and the structure of sentiment dictionary itself is very important.If the small scale of sentiment dictionary, can omit a lot of emotion words, the emotion of None-identified text tendency, particularly for some short texts, is more difficult for hitting emotion word; If sentiment dictionary is of low quality, also can cause the mistake of sentiment analysis result.

Summary of the invention

In view of above content, be necessary to provide the multiple sentiment dictionary construction method in a kind of internet and system, the cooccurrence relation of its elementary cell (as word, symbol etc.) of utilizing some text representation emotions in internet text, in conjunction with the method for new word discovery, by iteration circulation way, automatically construct sentiment dictionary.

A multiple sentiment dictionary construction method, the method comprises: obtaining step, obtains internet text language material from internet; Data pre-treatment step, carries out data pre-service to obtain the candidate word of sentiment dictionary to obtained corpus of text; Extract neologisms step, from obtained corpus of text, extract neologisms to obtain the candidate word of sentiment dictionary; Design of graphics model step, utilizes the candidate word of resulting sentiment dictionary to build non-directed graph model; Iterative computation step, utilizes the multiple emotion score of each node in non-directed graph model and label propagation algorithm iterative computation non-directed graph to build sentiment dictionary.

A multiple sentiment dictionary constructing system, this system comprises: acquisition module, for obtain internet text language material from internet; Data preprocessing module, carries out data pre-service to obtain the candidate word of sentiment dictionary for the corpus of text to obtained; Extract neologisms module, for the corpus of text from obtained, extract neologisms to obtain the candidate word of sentiment dictionary; Design of graphics model module, for utilizing the candidate word of resulting sentiment dictionary to build non-directed graph model; Iterative computation module, for utilizing the multiple emotion score of non-directed graph model and each node of label propagation algorithm iterative computation non-directed graph to build sentiment dictionary.

Compared to prior art, the present invention is directed to the deficiency of sentiment dictionary in the existing sentiment analysis algorithm of internet text, propose to build the method for the identification multiple emotion of internet text sentiment dictionary used.Compare with classic method, this method utilizes emotion mark, network neologisms, emotion icons, the mistake of more distinctive informal texts in internet text to write the structure dictionaries such as word, is not limited to traditional emotion word in single language or field.Adopt different seed words can construct the different moods sentiment dictionary of (as happy, angry, grieved, frightened, surprised etc.), make the result of emotion recognition abundanter.

Accompanying drawing explanation

Fig. 1 is the applied environment figure of the multiple sentiment dictionary constructing system in internet of the present invention.

Fig. 2 is the module map of the preferred embodiment of the multiple sentiment dictionary constructing system in internet of the present invention.

Fig. 3 is the process flow diagram of the preferred embodiment of the multiple sentiment dictionary constructing system in internet of the present invention.

Fig. 4 is typical high frequency tuple schematic diagram.

Fig. 5 is non-directed graph model schematic diagram.

Fig. 6 is matrix schematic diagram in the same way.

Fig. 7 is the adopted score schematic diagram of passing judgement on of word.

Fig. 8 is the mood score schematic diagram of word.

Main element symbol description

Calculation element	1
		The multiple sentiment dictionary constructing system in internet	10

Storer	20
		Processor	30
Display device	40
		Input equipment	50
Acquisition module	100
		Data preprocessing module	101
Extract neologisms module	102
		Design of graphics model module	103
Iterative computation module	104

Following embodiment further illustrates the present invention in connection with above-mentioned accompanying drawing.

Embodiment

As shown in Figure 1, be the applied environment figure of the multiple sentiment dictionary constructing system 10 in internet of the present invention (hereinafter to be referred as system 10) preferred embodiment.This system 10 runs in calculation element 1.This calculation element 1 also comprises by the connected storer 20 of data bus, processor 30, display device 40 and input equipment 50.Calculation element 1 can be computer, mobile phone, PDA(Personal Digital Assistant, personal digital assistant) etc.

Storer 20 is for program code and other data information of storage system 10.Described display device 40 can be that the LCDs of computer is, the touch-screen of mobile phone etc.The various data that described input equipment 50 arranges for inputting user, for example, keyboard, mouse etc.

Consulting shown in Fig. 2, is the functional block diagram of these system 10 preferred embodiments.In one embodiment, this system 10 mainly comprises acquisition module 100, data preprocessing module 101, extracts neologisms module 102, design of graphics model module 103 and iterative computation module 104.Module 100-104 is the program segment that comprises computer instruction, can complete specific function, than program, is more suitable in describing the implementation of software in calculation element 1.The computer instruction that module 100-104 comprises is stored in storer 20, and the processor 30 of calculation element 1 is carried out these computer instructions.Concrete function below in conjunction with Fig. 3 specification module 100-104.

As shown in Figure 3, be the process flow diagram of the preferred embodiment of the multiple sentiment dictionary construction method in internet of the present invention.According to different demands, in this process flow diagram, the order of step can change, and some step can be omitted.

Step S10, acquisition module 100 obtains internet text language material from internet.

Step S11,101 pairs of corpus of text that obtain of data preprocessing module carry out data pre-service to obtain the candidate word of sentiment dictionary.

Data preprocessing module 101 need to be carried out participle to obtained corpus of text.For the corpus of text (as English) of space-separated, can directly pass through space participle; And for Chinese, Japanese etc. without space as the corpus of text of separating, can obtain candidate's word set by the mode of extraction n tuple (n-gram).For this candidate's word set, remove a certain proportion of high frequency tuple (normally stop words etc.) and low frequency tuple (normally name, non-word etc.) afterwards, only get remaining intermediate frequency tuple as the candidate word of sentiment dictionary.If it should be noted that a certain language is adopted to suitable participle instrument, then in conjunction with n tuple generating candidate words collection, can not remove the n tuple of word, improve sentiment dictionary precision.This processing does not hinder the validity of holistic approach.

Internet text language material, as the content sources that builds sentiment dictionary, need to can extract lexical relation through suitable cleaning, and the relevant pre-treatment step such as data scrubbing comprise:

Step 1.1, removes the special word in corpus of text.Special word comprises website links, user name mark, special character etc.

Step 1.2, carries out participle to corpus of text, then based on word segmentation result, generates n tuple (n<4), so that the word of participle mistake is supplemented.Like this, from corpus of text, extract a tuple, two tuples and tlv triple totally three class tuple-sets.As Chinese corpus of text carried out to participle, be to complete based on Chinese lexical analysis system (Institute of Computing Technology, Chinese Lexical Analysis System, ICTCLAS) instrument.

Step 1.3, consider the characteristic of speech sounds of word, in described three class tuple-sets, also remove respectively the rank forefront high frequency tuple (being high frequency words) of default figure place (as ranking forefront 50) and lower than the low frequency tuple (being low-frequency word) of preset times (as 3 times) of in corpus of text occurrence number.High frequency tuple is stop words normally, and they have higher co-occurrence chance with all kinds of words, therefore to the expression of emotional characteristic not obvious; Low frequency tuple is non-word or user name etc. conventionally, and these tuples do not have language meaning, therefore need to be removed.Like this, using occurrence number intermediate frequency tuple placed in the middle as a part of candidate word.

As shown in Figure 4, the microblogging language material using is derived from Tengxun's microblogging, adopts 69,715 microbloggings (152,716 sentences), through removing user name, website links etc., and carries out participle, counts high frequency tuple and low frequency tuple and is removed.Typical high frequency tuple as shown in Figure 4.

Step S12, extracts neologisms module 102 and extracts neologisms to obtain the candidate word of sentiment dictionary from obtained corpus of text.

Except usining n tuple as candidate word, also adopt the method for context entropy and mutual information to find that neologisms are as the candidate word of sentiment dictionary.Because tuple is to carry out on the basis of word segmentation result, may exist participle boundary error to cause the boundary error (candidate's " word " who generates not is actual word) of generating candidate words.If but generated tuple as candidate word based on Zi Wei unit, the noise of a large amount of non-words could be introduced again.Therefore,, except usining the candidate word of word segmentation result generation tuple as sentiment dictionary, also need identification to find that some neologisms are used as the candidate word of sentiment dictionary.The present invention is integrated into two kinds of new word discovery methods in the middle of sentiment analysis sentiment dictionary structure, and these two kinds of new word discovery methods are the new word discovery method of context entropy new word discovery method, mutual information.

(1) context entropy new word discovery side ratio juris is as follows:

Context entropy according to tuple (word) determines that its border extension is to form neologisms.

The left context entropy of a word w of take is example, and its definition left context entropy LCE (w) computing formula is:

LCE (w) = \frac{- 1}{N (w)} Σ_{i = 1}^{s} C (a_{i}, w) \ln \frac{C (a_{i}, w)}{N (w)}

Wherein N (w) represents the occurrence number of word w in corpus of text, C (a _i, w) be w and another word a _ico-occurrence number of times in corpus of text.When calculating LCE (w), a _ibe the single candidate word that appears at w left side, the neologisms after expansion are

s is candidate word a _inumber, the various words that w left side occurs.Lower LCE (w) reflects that the text (above) in w left side is comparatively single, therefore has necessity of further expansion left border.Then use

replace the w variable in above-mentioned LCE (w) computing formula, can calculate left context entropy (above entropy) to neologisms

and calculate the increment of entropy

ΔLCE (\overset{&OverBar;}{a_{i} w}) = LCE (\overset{&OverBar;}{a_{i} w}) - LCE (w)

If this increment is larger, show that old word w left side is unlikely border, and neologisms

left side may be more border.At this moment can be by word

replace w as new candidate word.Similarly, right context entropy RCE (w) computing formula of word and entropy thereof increase

computing formula is:

RCE (w) = \frac{- 1}{N (w)} Σ_{i = 1}^{s} C (w, b_{i}) \ln \frac{C ({w, b}_{i})}{N (w)}

ΔRCE (\overset{&OverBar;}{{wb}_{i}}) = RCE (\overset{&OverBar;}{{wb}_{i}}) - RCE (w)

Here b _ibe a candidate word of its right side expansion, the neologisms after expansion are

n (w) represents the occurrence number of word w in corpus of text, C (w, b _i) be w and another word b _ico-occurrence number of times in corpus of text.When calculating RCE (w).S is candidate word b _inumber, the various words that w right side occurs.

(2) the new word discovery side of mutual information ratio juris is as follows:

According to mutual information, determine whether a tuple should be left neologisms.

Above (left side) word of note word w is a _i, below (right side) word is b _i.The mutual information of both sides: left side mutual information

right side mutual information

be defined as:

LPMI (\overset{&OverBar;}{a_{i} w}) = \frac{C (a_{i}, w)}{N (a_{i}) N (w)}

RPMI (\overset{&OverBar;}{w b_{i}}) = \frac{C ({w, b}_{i})}{N (w) N (b_{i})}

Wherein N (w), N (a _i) and N (b _i) be respectively word w, a _iwith b _ioccurrence number, C (a _i, w) with C (w, b _i) be respectively w and a _i, b _ione-sided co-occurrence number of times.

When a side mutual information surpasses setting threshold (according to corpus of text adjustment), accept this side word peripheral growth; Until lower than this setting threshold, stop peripheral growth, using current word as neologisms.

For example, the new word discovery algorithm carrying out on above-mentioned participle basis, these neologisms comprise the name that participle dictionary do not include (as " Tim Cook ", " base of a fruit nurse Cook ", or " Liu Zhiwei ”,“He village grand it " etc.), new term (microblogging ”,“ Sina of Ru“ Tengxun microblogging ", " hungry marketing ") and idiom (as " bag postal ", " suffering God's punishment ") etc.

Step S13, design of graphics model module 103 utilizes the candidate word of resulting sentiment dictionary to build non-directed graph model.

After obtaining each candidate word of sentiment dictionary, design of graphics model module 103 calculates each candidate word common number of times occurring in the sentence of corpus of text, as the mutual relationship between any two candidate word (being mutual information value).Take each candidate word as node, and mutual relationship (mutual information value) constructs non-directed graph model G as limit power.In a large amount of corpus of text, the common word occurring more may have close emotion, and two nodes on the limit that therefore limit power is higher in non-directed graph can have close emotion tendency.

Constructed non-directed graph model is out represented with matrix G=(V, E), and this G represents the annexation between candidate word, and wherein V represents the set of candidate word, and E represents the set on limit.The corresponding candidate word (v ∈ V) of each node v in this G, limit (v _i, v _j) corresponding to two candidate word v _iwith v _jcooccurrence relation ((v _i, v _j) ∈ E), limit (v _i, v _j) weight w _ijthese two node v _iand v _jthe number of times of co-occurrence in corpus of text.

By each co-occurrence matrix W (being the adjacency matrix of G) expression for internodal cooccurrence relation in V, co-occurrence matrix W is symmetrical, the element w in co-occurrence matrix W _ijrepresent limit (v _i, v _j) weight, be this two node v _i, v _jthe number of times of co-occurrence in corpus of text, the element w on the diagonal line of co-occurrence matrix W _iicorresponding to v _ithe quantity occurring in corpus of text, this co-occurrence matrix is used at subsequent step S14 iterative computation sentiment dictionary.

For example, as shown in Figure 5, the non-directed graph model of being constructed by corpus of text " I/study/science ", " science/very/profundity ", " I/like/study " three words, its bend is separated the result representing after participle, retain whole words as the candidate word of sentiment dictionary, the corresponding node of each candidate word, single line represents that weight is 1, two-wire represents that weight is 2.The resulting co-occurrence matrix of the non-directed graph model in Fig. 5 as shown in Figure 6, as node " I " and node " study as described in corpus of text the number of times of co-occurrence be 2, the element w in co-occurrence matrix W ₁₂be 2, the element w on the diagonal line of co-occurrence matrix W ₃₃number of times for node " study " occurs in described corpus of text, is 2.

Step S14, iterative computation module 104 utilizes the multiple emotion score of each node in non-directed graph model and label propagation algorithm iterative computation non-directed graph to build sentiment dictionary.

Iterative computation module 104 is chosen a small amount of emotion seed word (being seed word) and is given its emotion score (described emotion score comprises mood score, commendation score, derogatory sense score and absolute score in the node of non-directed graph model, described mood score comprises happy score, angry score, grieved score, frightened score and surprised score), again by label propagation algorithm, this emotion score is propagated into the node of whole connections under the effect of limit power, each node will obtain corresponding multiple emotion score.After iteration convergence (score is stable), the node of each connection has been endowed multiple emotion score, the emotion score of each node represents the emotion tendency of the corresponding candidate word of this node, and the candidate word that these nodes are corresponding and multiple emotion score thereof have formed sentiment dictionary.

In this step, choosing of seed word can be some definite emotion words, and different language is chosen accordingly; Also can be the emotion mark of some and language independent, as smiling face's symbol:-) etc.These processing can guarantee the validity of label propagation algorithm to different language.Seed word according to word set from choosing in step S11 and the resulting candidate word of step S12.In the present embodiment, described word set comprises commendation, derogatory sense word set and mood word set, and described commendation, derogatory term centralized procurement pass judgement on the < < student that Zhang Wei etc. writes 728 commendatory terms and 933 derogatory terms that adopted dictionary > > arranges, described mood word centralized procurement Ge Xu, Xinfan Meng, five kinds of mood word sets that Houfeng Wang etc. arrange, comprise happiness, indignation, grieved, frightened and surprised, seed word quantity is respectively 91, 112, 89, (see document Xu G for 103 and 92, Meng X, Wang H.Build Chinese emotion lexicons using a graph-based algorithm and multiple resources.Proceedings of the23rd International Conference on Computational Linguistics, Stroudsburg, PA, USA:Association for Computational Linguistics, 2010.1209 – 1217.).Although these seed words are modular word, the mode of propagating by iteration can also be given certain emotion score to other candidate word in non-directed graph (as neologisms, icon) etc.

The seed word that iterative computation module 104 is chosen different emotions (as commendation, derogatory sense etc., mood (as happy, angry, grieved etc.)) carries out respectively iteration, the different emotions score of each node has been calculated respectively, adopt different emotions seed word, can obtain the sentiment dictionary of corresponding emotion, as adopted mood seed word to carry out iterative computation, just can obtain the sentiment dictionary that the mood score by candidate word corresponding to each node is formed.

Adopt label transmission method, from the seed word of choosing, its emotion score is propagated in the node of all connections in non-directed graph.Iterative process is as shown in the formula description:

x ^(k+1)＝W·x ^(k)+b

Wherein, x ^(k)the emotion score vector that represents the node after the k time iteration.Formula thus, the result of calculation x of new round iteration ^(k+1)after acting on previous round vector by co-occurrence matrix W and bias vector b, draw.At each, take turns after iterative computation, result is normalized, iterative process finally restrains.B is made as Seeding vector x in the present invention ⁽⁰⁾, to strengthen the effect of seed.Select after seed, vector x ⁽⁰⁾the dimension value that middle seed word is corresponding is 1, and other dimension values are 0.

As shown in Figure 7,728 commendatory terms that adopt that < < student that above-mentioned Zhang Wei etc. writes passes judgement on that adopted dictionary > > arranges and 933 derogatory terms are as the word set of seed word, carry out successively iteration and propagate and to calculate the adopted emotion score of passing judgement on of each node (candidate word), some of them word pass judgement on adopted emotion score example as shown in Figure 7; As shown in Figure 8, adopt five kinds of mood word sets of the arrangements such as above-mentioned Ge Xu, Xinfan Meng, Houfeng Wang as the word set of seed word, calculate five kinds of mood degree scores of each node, these scores embody the present invention in the dirigibility of the multiple emotion degree of identification.

The multiple sentiment dictionary construction method in internet of the present invention and system, for the deficiency of sentiment dictionary in the existing sentiment analysis algorithm of internet text, propose to build the method for the identification multiple emotion of internet text sentiment dictionary used.The cooccurrence relation of the elementary cell (as word, symbol etc.) that the method is utilized some text representation emotions in internet text, in conjunction with the method for new word discovery, constructs sentiment dictionary automatically by iteration circulation way.Compare with classic method, the present invention utilizes emotion mark, network neologisms, emotion icons, the mistake of more distinctive informal texts in internet text to write the structure dictionaries such as word, is not limited to traditional emotion word in single language or field.Adopt different seed words can construct the different moods sentiment dictionary of (as happy, angry, grieved, frightened, surprised etc.), make the result of emotion recognition abundanter, and then identify the emotion that whole section of text representation goes out.And experimental result shows, the present invention also has required language material scale without excessive, is not subject to the advantages such as time restriction, so this invention has suitable application area, language is wide, the variation of identification affective style.

Finally it should be noted that, above embodiment is only unrestricted in order to technical scheme of the present invention to be described, although the present invention is had been described in detail with reference to preferred embodiment, those of ordinary skill in the art is to be understood that, can modify or be equal to replacement technical scheme of the present invention, and not depart from the spirit and scope of technical solution of the present invention.

Claims

1. the multiple sentiment dictionary construction method in internet, is characterized in that, the method comprises:

Obtaining step, obtains internet text language material from internet;

Data pre-treatment step, carries out data pre-service to obtain the candidate word of sentiment dictionary to obtained corpus of text;

Extract neologisms step, from obtained corpus of text, extract neologisms to obtain the candidate word of sentiment dictionary;

Design of graphics model step, utilizes the candidate word of resulting sentiment dictionary to build non-directed graph model;

Iterative computation step, utilizes the multiple emotion score of each node in non-directed graph model and label propagation algorithm iterative computation non-directed graph to build sentiment dictionary.

2. the multiple sentiment dictionary construction method in internet as claimed in claim 1, is characterized in that, described data pre-treatment step comprises:

Remove step, remove the special word in corpus of text;

Participle and extraction step, carry out participle to corpus of text, and generate n tuple based on word segmentation result, extracts a tuple, two tuples and tlv triple totally three class tuple-set, wherein n<4 from corpus of text;

Remove step, in described three class tuple-sets, remove respectively the rank forefront high frequency tuple of default figure place and lower than preset times low frequency tuple of in corpus of text occurrence number, using the candidate word as a part of sentiment dictionary by occurrence number intermediate frequency tuple placed in the middle.

3. the multiple sentiment dictionary construction method in internet as claimed in claim 1, is characterized in that, comprises: the new word discovery method of context entropy new word discovery method and mutual information in described extraction neologisms step from the method for obtained corpus of text extraction neologisms.

4. the multiple sentiment dictionary construction method in internet as claimed in claim 3, is characterized in that, described design of graphics model step comprises:

Calculation procedure, calculates the number of times of each candidate word common appearance in the sentence of corpus of text of sentiment dictionary, as the mutual relationship between any two candidate word;

Build non-directed graph model step, take each candidate word as node, mutual relationship is weighed as limit, builds non-directed graph model.

5. the multiple sentiment dictionary construction method in internet as claimed in claim 4, it is characterized in that, in building non-directed graph model step, by matrix G=(V for constructed non-directed graph model, E) represent, this G is used for representing the annexation between candidate word, and wherein V represents the set of candidate word, and E represents the set on limit;

The corresponding candidate word of each node v in this G, v ∈ V wherein, limit (v _i, v _j) corresponding to two candidate word v _iwith v _jcooccurrence relation, (v wherein _i, v _j) ∈ E;

Each internodal cooccurrence relation in V is represented by co-occurrence matrix W, and co-occurrence matrix W is the adjacency matrix of G, and it is symmetrical, the element w in co-occurrence matrix W _ijrepresent limit (v _i, v _j) weight, be this two node v _i, v _jthe number of times of co-occurrence in corpus of text, the element w on the diagonal line of co-occurrence matrix W _iicorresponding to v _ithe quantity occurring in corpus of text.

6. the multiple sentiment dictionary construction method in internet as claimed in claim 5, is characterized in that, described iterative computation step comprises:

Selecting step, in the node of non-directed graph model, its emotion score given in selected seed word;

Propagation steps, by label propagation algorithm, under the effect of limit power, propagates into this emotion score from the seed word of choosing the node being all communicated with non-directed graph, and each node will obtain corresponding multiple emotion score;

Build sentiment dictionary step, after iteration convergence, the node of each connection has been endowed multiple emotion score, and the emotion score of each node represents the emotion tendency of the corresponding candidate word of this node, and the candidate word that these nodes are corresponding and multiple emotion score thereof have formed sentiment dictionary.

7. the multiple sentiment dictionary construction method in internet as claimed in claim 2, is characterized in that, described high frequency tuple is stop words, and it has higher co-occurrence chance with all kinds of words; Described low frequency tuple is non-word, user name.

8. the multiple sentiment dictionary construction method in internet as claimed in claim 1, is characterized in that, described emotion score comprises mood score, commendation score, derogatory sense score and absolute score,

Described mood score comprises happy score, angry score, grieved score, frightened score and surprised score.

9. the multiple sentiment dictionary construction method in internet as claimed in claim 6, is characterized in that, described seed root is chosen out from each candidate word of resulting sentiment dictionary according to word set, it comprise definite emotion word and with the emotion mark of language independent.

10. the multiple sentiment dictionary construction method in internet as claimed in claim 6, is characterized in that, the iterative process in propagation steps is as shown in the formula description:

x ^(k+1)＝W·x ^(k)+b

X wherein ^(k)the emotion score vector that represents the node after the k time iteration, formula thus, the result of calculation x of new round iteration ^(k+1)after acting on previous round vector by co-occurrence matrix W and bias vector b, draw, at each, take turns after iterative computation, result is normalized, and iterative process finally restrains.

The 11. multiple sentiment dictionary construction methods in internet as claimed in claim 10, is characterized in that, b is taken as Seeding vector x ⁽⁰⁾, to strengthen the effect of seed, select after seed, vector x ⁽⁰⁾the dimension value that middle seed word is corresponding is 1, and other dimension values are 0.

12. 1 kinds of multiple sentiment dictionary constructing systems in internet, is characterized in that, this system comprises:

Acquisition module, for obtaining internet text language material from internet;

Data preprocessing module, carries out data pre-service to obtain the candidate word of sentiment dictionary for the corpus of text to obtained;

Extract neologisms module, for the corpus of text from obtained, extract neologisms to obtain the candidate word of sentiment dictionary;

Design of graphics model module, for utilizing the candidate word of resulting sentiment dictionary to build non-directed graph model;

Iterative computation module, for utilizing the multiple emotion score of non-directed graph model and each node of label propagation algorithm iterative computation non-directed graph to build sentiment dictionary.

The 13. multiple sentiment dictionary constructing systems in internet as claimed in claim 12, is characterized in that, the process that described data preprocessing module is processed comprises:

Remove the special word in corpus of text;

Corpus of text is carried out to participle, and generate n tuple based on word segmentation result, from corpus of text, extract a tuple, two tuples and tlv triple totally three class tuple-set, wherein n<4;

In described three class tuple-sets, remove respectively the rank forefront high frequency tuple of default figure place and lower than preset times low frequency tuple of in corpus of text occurrence number, using the candidate word as a part of sentiment dictionary by occurrence number intermediate frequency tuple placed in the middle.

The 14. multiple sentiment dictionary constructing systems in internet as claimed in claim 12, is characterized in that, comprise: the new word discovery method of context entropy new word discovery method and mutual information in described extraction neologisms module from the method for obtained corpus of text extraction neologisms.

The 15. multiple sentiment dictionary constructing systems in internet as claimed in claim 14, is characterized in that, the process that described design of graphics model module builds comprises:

Calculate each candidate word common number of times occurring in the sentence of corpus of text of sentiment dictionary, as the mutual relationship between any two candidate word;

Take each candidate word as node, and mutual relationship is weighed as limit, builds non-directed graph model.

The 16. multiple sentiment dictionary constructing systems in internet as claimed in claim 15, it is characterized in that, in building non-directed graph model step, by matrix G=(V for constructed non-directed graph model, E) represent, this G is used for representing the annexation between candidate word, and wherein V represents candidate word set, and E represents the set on limit;

The 17. multiple sentiment dictionary constructing systems in internet as claimed in claim 16, is characterized in that, the computation process of described iterative computation module comprises:

In the node of non-directed graph model, its emotion score given in selected seed word;

By label propagation algorithm, under the effect of limit power, from the seed word of choosing, this emotion score is propagated into the node being all communicated with non-directed graph, each node will obtain corresponding multiple emotion score;

After iteration convergence, the node of each connection has been endowed multiple emotion score, and the emotion score of each node represents the emotion tendency of the corresponding candidate word of this node, and the candidate word that these nodes are corresponding and multiple emotion score thereof have formed sentiment dictionary.

The 18. multiple sentiment dictionary constructing systems in internet as claimed in claim 13, is characterized in that, described high frequency tuple is stop words, and it has higher co-occurrence chance with all kinds of words; Described low frequency tuple is non-word, user name.

The 19. multiple sentiment dictionary constructing systems in internet as claimed in claim 12, it is characterized in that, described emotion score comprises mood score, commendation score, derogatory sense score and absolute score, and described mood score comprises happy score, angry score, grieved score, frightened score and surprised score.

The 20. multiple sentiment dictionary constructing systems in internet as claimed in claim 17, it is characterized in that, described seed root is chosen out from each candidate word of resulting sentiment dictionary according to word set, it comprise definite emotion word and with the emotion mark of language independent.

The 21. multiple sentiment dictionary constructing systems in internet as claimed in claim 17, is characterized in that, the iterative process by label propagation algorithm is as shown in the formula description:

x ^(k+1)＝W·x ^(k)+b

The 22. multiple sentiment dictionary constructing systems in internet as claimed in claim 21, is characterized in that, b is taken as Seeding vector x ⁽⁰⁾, to strengthen the effect of seed, select after seed, vector x ⁽⁰⁾the dimension value that middle seed word is corresponding is 1, and other dimension values are 0.