CN102880600B

CN102880600B - Based on the phrase semantic tendency Forecasting Methodology of world knowledge network

Info

Publication number: CN102880600B
Application number: CN201210316850.3A
Authority: CN
Inventors: 刘瑞; 安翼; 陈君龙; 宋浪
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2012-08-30
Filing date: 2012-08-30
Publication date: 2015-10-28
Anticipated expiration: 2032-08-30
Also published as: CN102880600A

Abstract

The invention discloses a kind of phrase semantic tendency Forecasting Methodology based on world knowledge network, comprising: (1) judges whether unknown word is present in emotion word dictionary, if existed, return the polarity of unknown word, if there is no then enter step (2); (2) commendation benchmark word set and derogatory term benchmark word set is chosen; (3) tightness degree between unknown word and commendation word set is calculated; (4) tightness degree between unknown word and derogatory sense word set is calculated; (5) difference of the tightness degree between unknown word and commendation word set and the tightness degree between unknown word and derogatory sense benchmark word set is calculated; (6) according to the difference of step (5), selected threshold space judges the polarity of unknown word.The present invention combines the degree of association of word while considering semantic similarity, have employed region threshold and judges, avoids emotion tendency word being given mistake, semantic tendency judging nicety rate obtains lifting.

Description

Based on the phrase semantic tendency Forecasting Methodology of world knowledge network

Technical field

The present invention relates to a kind of phrase semantic tendency Forecasting Methodology, particularly relate to a kind of phrase semantic tendency Forecasting Methodology based on world knowledge network, belong to computer information data processing technology field.

Background technology

Developing rapidly and extensively popularizing of internet, changes the life style of people to a great extent.People can not only receive information passively, can also carry out alternately with the external world.Internet becomes a kind of interactive media gradually, and people can deliver comment to various things by network medias such as BBS, Blogs.The data of " China Internet network state of development statistical report " that in July, 2010 is issued by the China Internet Network Information Center show: the utilization rate of blog applications, forum/BBS is all in the prostatitis of network application.Increasing rapidly of these viewpoint informations, for scientific research personnel provides application and research object widely, and causes the extensive concern of industrial community and researcher.

On network, the comment of these subjectivities contains the information in a large number with Sentiment orientation, these information, no matter for the common network user, or has very important value for manufacturer and other organization.Word is the most fundamental element of all sentences, text composition, and the Sentiment orientation of word or polarity can imply the semantic tendency of sentence and even whole text level very well.The Sentiment orientation of word or polarity discriminating are in the research analyzed the semantic tendency in the comment to subjectivity, and playing a part core, is the basis of Semantic sentiment classification.

Psychological study has found the measurability between word and human emotion.Word or phrase, for emotional semantic classification, be most important are also the most basic features.In human language, have a class word, people directly use them to express the emotion of oneself, or like or detest, or agree with or oppose, or praise or belittle, especially when the quality of people to a certain things makes evaluation time, often through the viewpoint using this kind of word to express oneself distinctness.This word with viewpoint or Sentiment orientation is referred to as emotion word (Senitment Word).Usually, the polarity of emotion word can be divided three classes: (Positive) in front, negative (Negative) and neutral (Neutral).But because the affective characteristics of neutral words is not obvious, little to the effect distinguishing text polarity, all only focus in great majority research analyzing commendation, the obvious emotion word of derogatory sense two class polarity.

Some researchers propose emotion word polarity number is set to-1 to 1 between continuous print real number value, wish the difference showing each emotion word polarity in a kind of more detailed mode.But due in real life, people to the degree of passing judgement on of each emotion word not compliance understanding, polarity number that authority cannot be provided, that quantize, therefore, most researchist will the polarity number of emotion word as discrete value process.This processing mode also can make the calculating of word polarity and process more simple, brings the raising in counting yield, and the polar character of emotion word also can be made to a certain extent more obvious.

Polarity for the comment word of these subjectivities on network judges, mainly contains two kinds of thinkings.Being the method based on generality statistics, by analyzing the word regularity of distribution in Large Scale Corpus, drawing the similarity of word.Take Turney as representative, he uses the method for adding up based on word cooccurrence relation completely to calculate the similarity of word.The starting point of this method supposes based on such one---the emotion word of identical polar is tended to occur together, and many experiments also demonstrate the validity of this hypothesis.But this method needs a large amount of texts and makes training set, and the complexity calculated is higher.Another kind method is the method based on dictionary, and such as English dictionary WordNet and Chinese dictionary know net (HowNet).These class methods normally by the semantic structure of research dictionary, find out the semantic relation between word, and computing semantic " distance ".This semanteme " distance " is taken as the similarity between word usually, and in this, as the tendentious a kind of means of prediction word.

When these two class methods carry out sentiment classification to text, all depend on polarity dictionary, therefore the quality of polarity dictionary directly affects the correctness that emotion tendency judges, and the structure of current polarity dictionary is all undertaken by manual, and workload is large and polarity dictionary is incomplete.Because polarity dictionary Convergence-free spaces is limited and be difficult to upgrade in time, only be applicable in existing polarity dictionary carrying out emotional orientation analysis to the everyday words of specification, then cannot use for emerging word, some particular words or new semanteme, be not suitable with the high speed development change of information and the widespread demand of word analysis.

Be in the Chinese invention patent application of 201010229011.9 at application number, disclose a kind of emotion tendentiousness of subjective text analytical approach, comprise the steps: to set up in advance extendible, tendency degree is a quantitative polarity dictionary; Pre-service is carried out to text to be analyzed; Utilize semantic character labeling instrument, pretreated text semantic role is marked; Adopt reference resolution method, the object entities such as pronoun are reduced; Set up domain characteristic library; Utilize polarity dictionary and feature database to complete emotion word identification and Feature Words identification respectively, calculate the emotion tendency value of each feature, then the emotion tendency value of correlated characteristic in statistical computation every, finally draws the overall Sentiment orientation value of each feature.

Summary of the invention

For the deficiency existing for prior art, technical matters to be solved by this invention is to provide a kind of phrase semantic tendency Forecasting Methodology based on world knowledge network.The method effectively can improve the accuracy rate that semantic tendency is analyzed.

For realizing above-mentioned goal of the invention, the present invention adopts following technical scheme:

Based on a phrase semantic tendency Forecasting Methodology for world knowledge network, it is characterized in that comprising the steps:

(1) judge whether unknown word is present in emotion word dictionary, if existed, return the polarity of unknown word, if there is no, then enter step (2);

(2) commendation benchmark word set and derogatory term benchmark word set is chosen;

(3) tightness degree between described unknown word and described commendation word set is calculated;

(4) tightness degree between described unknown word and described derogatory sense word set is calculated;

(5) difference of the tightness degree between described unknown word and described commendation word set and the tightness degree between described unknown word and described derogatory sense benchmark word set is calculated;

(6) difference according to step (5), selected threshold space judges the polarity of described unknown word.

Wherein more preferably, described emotion word dictionary is obtained by the former relation of justice of traversal world knowledge network.

Wherein more preferably, described commendation benchmark word set is one group of word that in described emotion word dictionary, commendatory term word frequency is the highest;

Described derogatory sense benchmark word set is one group of word that in described emotion word dictionary, derogatory term word frequency is the highest.

Wherein more preferably, described unknown word and described commendation word set or described derogatory term concentrate the tightness degree com (p, word) between certain word p to represent, are calculated by following formula:

com(p,word)＝sim(p,word)+rel(p,word)

Wherein, word represents unknown word, and p represents commendation benchmark word or derogatory sense benchmark word, P _setrepresent commendation benchmark word set or derogatory sense benchmark word set, p ∈ P _set, sim (p, word) represents the semantic similarity between p and unknown word word, and rel (p, word) represents the word degree of correlation between p and unknown word word.

Wherein more preferably, the word degree of correlation rel (p, word) between described commendation benchmark word or derogatory sense benchmark word p and unknown word word is calculated by following formula:

rel (p, word) = \frac{| conRel (p) \cap conRel (word) |}{| conRel (p) \cup conRel (word) |}

Wherein, | con Rel (p) ∩ con Rel (word) | be the number that benchmark word p and unknown word word dependent field are occured simultaneously; | con Rel (p) ∪ con Rel (word) | be the number of benchmark word p and unknown word word dependent field union.

Wherein more preferably, the difference of the tightness degree between described unknown word and described commendation word set and the tightness degree between described unknown word and derogatory sense benchmark word set represents with senti (word), is calculated by following formula:

senti (word) = \underset{p &Element; P_{set}}{Σ} com (p, word) - \underset{n &Element; N_{set}}{Σ} com (n, word)

Wherein, word represents unknown word, and p represents commendation benchmark word, P _setrepresent commendation benchmark word set, p ∈ P _set, com (p, word) represents unknown word word and described commendation word set P _setin tightness degree between certain word p, represent unknown word word and described commendation word set P _setall words between tightness degree sum; N represents derogatory sense benchmark word, N _setrepresent commendation benchmark word set, n ∈ N _set, com (n, word) represents unknown word word and described commendation word set N _setin tightness degree between certain word n, represent unknown word word and described derogatory sense word set N _setall words between tightness degree sum.

Wherein more preferably, described in choose suitable threshold space, judge that the step of unknown word polarity judges word polarity by following algorithm:

Polarity (word) = \{\begin{matrix} 1, Senti (word) > b \\ 0, a \leq Senti (word) \leq b \\ - 1, Senti (word) < a \end{matrix}\}

Wherein, word represents unknown word, senti (word) represents the difference of the tightness degree between unknown word word and described commendation word set and the tightness degree between unknown word word and derogatory sense benchmark word set, and a represents first threshold, and b represents Second Threshold;

If the extreme value obtaining described unknown word is 1, then described unknown word is commendatory term;

If the extreme value obtaining described unknown word is 0, then described unknown word is neutral words;

If the extreme value obtaining described unknown word is-1, then described unknown word is derogatory term.

Wherein more preferably, described first threshold and described Second Threshold, according to the value of single-point threshold value under optimal cases, are determined by following formula:

[a,b]＝[δ-0.5,δ+0.5]

Wherein, a represents first threshold, and b represents Second Threshold, and δ represents the value of single-point threshold value under optimal cases.

Provided by the present inventionly while considering semantic similarity in the phrase semantic tendency Forecasting Methodology of world knowledge network, combine the mutual degree of association of word, have employed region threshold to judge, avoid emotion tendency word being given mistake, semantic tendency judging nicety rate obtains lifting.

Accompanying drawing explanation

Fig. 1 is the tree-shaped Semantic hierarchy example schematic diagram knowing net;

Fig. 2 knows net " good ", " green bristlegrass " classification exemplary graph;

Fig. 3 is word Tendency Forecast method flow schematic diagram;

Fig. 4 is word Tendency Forecast result schematic diagram when adopting single-point threshold value;

Fig. 5 is word Tendency Forecast Comparative result schematic diagram;

Fig. 6 adopts single-point threshold value and region threshold word Tendency Forecast to contrast schematic diagram.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in further detail.

In phrase semantic tendency Forecasting Methodology provided by the present invention, first judge whether unknown word is present in emotion word dictionary, if exist and return polarity, if there is no, then similarity and dependent field information by calculating this unknown word and a benchmark seed emotion word word set judge its polarity.Specifically comprise, choose commendation benchmark word set and derogatory term benchmark word set, commendation word set is identical with the benchmark word quantity of benchmark word set; Calculate the tightness degree between described unknown word and described commendation word set; Calculate the tightness degree between described unknown word and described derogatory sense word set; Calculate the difference of the tightness degree between described unknown word and described commendation word set and the tightness degree between described unknown word and described derogatory sense benchmark word set; According to the difference obtained, choose suitable threshold space, judge described unknown word polarity.In the present invention, obtain emotion word dictionary by the former relation of word justice of traversal world knowledge network, the method is called the phrase semantic tendency Forecasting Methodology based on world knowledge network by the present inventor, launches detailed description below.

First introduce and how to obtain emotion word dictionary by the former relation of word justice of traversal world knowledge network, for the identification of emotion word, the most frequently used method carries out judging emotion word based on emotion word dictionary exactly.The structure of so-called emotion word dictionary, wishes collection basic emotion set of words exactly.Like this, computing machine just by this emotion word dictionary of inquiry, can judge whether a word is polarity word, and obtains its polarity number.In order to better understand the former relation of word justice of traversal world knowledge network, below to know that net introduces world knowledge network and based on knowing that net is to phrase semantic tendency Forecasting Methodology.

Semantic analysis is the important research aspect of of natural language processing technique, and a semantic dictionary can expressing conceptual relation is a kind of indispensable basic resource in natural language processing work.Know net be one with the concept representated by Chinese and english for description object, to disclose between concept and concept and pass between attribute that concept has is the knowledge base of substance.It is a netted knowledge system, containing semantic knowledge and the World Affairs of having a fair vocabulary.

Know in net have two main concepts: " concept " and " justice is former "." concept " describes the one of word, and each word can be expressed as several concept.It describes with a kind of " knowledge representation language ", and the vocabulary of this " knowledge representation language " is just called " justice is former "." justice is former " is the least unit for describing " concept ".Know that net defines altogether 1617 justice former.

The former one side of justice, as the most base unit describing concept, on the other hand, has between justice is former and there is complicated relation.Knowing in net, describe justice former between hyponymy, synonymy, antonymy, to 8 kinds of relations such as adopted relations.As shown in Figure 1, most important or hyponymy in adopted former relation, according to justice former be relation up and down, all " basic meaning is former " constitutes a former hierarchical system of justice.

Know that Netcom crosses the semanteme of a kind of Knowledge Description Language to word and is described.Knowing in the word tissue that net is included, some entries are had to be noted as " good " or " green bristlegrass ", have each two of the adjective of " good " and " green bristlegrass ", noun, verb as table 1 lists the mark collected in HowNet, wherein " good " is expressed as commendation, and " green bristlegrass " is expressed as derogatory sense.

Table 1 knows that net " good ", " green bristlegrass " entry are illustrated

In addition, also have that the justice of some classifications is former to be noted as " good " or " green bristlegrass ", but these classifications former contained entries of justice to be uselessly labeled as " good " or " green bristlegrass ", as shown in Figure 2." the good feelings of FeelingByGood " are noted as " good ", but its entry comprised not mark.

In the present invention, the vocabulary that all sememe classification comprises for " desired| is good " or " undesired| green bristlegrass ", and all sememe explains that the concept vocabulary containing " desired| is good " or " undesired| green bristlegrass " attribute in item is as polarity word.By knowing the traversal of net, the present invention obtains altogether 16624 qualified entries, wherein 8119 commendation entries, 8505 derogatory sense entries.In these entries, have 77.6% to be adjective, 20.8% is noun.All the other are a small amount of verb and adverbial word, and the word of other parts of speech is few as the situation of passing judgement on word.

Because a word can corresponding multiple concept explanation, same word may present different polarity in different concepts.And in actual use, we can only obtain word itself and part of speech from text, and do not know which kind of concept explanation of this word corresponding HowNet on earth within a context.Therefore, the polarity word dictionary that the present invention builds, based on vocabulary, contains altogether 6566 entries, wherein 3208 commendatory terms, 3358 derogatory terms.When organizing emotion word dictionary, the emotion word not only recorded itself, also have recorded its part of speech and polarity number.In actual applications, judge that the polarity of a word preferentially can use the mode of inquiry emotion word dictionary, and give the polarity number in this word dictionary.

How following introduction realizes phrase semantic tendency Forecasting Methodology based on the emotion word dictionary of world knowledge network.The emotion word that any one emotion word dictionary is included is all limited, therefore in the application of reality, often encounter many words be not embodied in emotion word dictionary, the present inventor is referred to as potential emotion word (LatentSentiment Bearing Word), and these words itself may with Sentiment orientation.With reference to knowing that the semantic similarity that net provides calculates thinking, the tendentiousness of the present invention to the potential emotion word be not embodied in emotion word dictionary is predicted.With reference to the relevant knowledge based on the semantic similarity and dependent field of knowing net, the present invention passes judgement on semantic similarity and the degree of correlation between benchmark word by the potential emotion word of calculating and one group, judges the tendentiousness of potential emotion word.

It is the important and work on basis of of the fields such as natural language processing, information retrieval and information extraction that semantic similarity calculates, and object is the similarity degree between tolerance word.Two words are more similar, and their concept distance is shorter, can set up a kind of simple corresponding relation therebetween.Usually, Similarity value is defined as a real number between 0 to 1, and absolute value is larger, and similarity is higher.

In " the 3rd the Chinese vocabulary semantics symposial " to hold for 2002, taken into full account in " based on knowing that the semantic similarity of net calculates " literary composition that Liu Qun, Li Sujian of the Chinese Academy of Sciences deliver word knowing the relation of the distance in the hierarchical system in net, whole and part, the relation etc. of feature structure proposes semantic similarity computing method based on knowing net.Word word ₁and word ₂between based on knowing that the semantic similarity of net is expressed as: sim (word ₁, word ₂).

Be different from the semantic similarity of word, what word dependent field was reacted is two inter-related degree of word, reflects the possibility that two words occur in same linguistic context.Such as, for " clever " this word, when justice was " clever| spirit " originally, one group of semantic relevant word can be obtained, as " wise ", " intelligent ", " sensible " etc.

The present invention utilizes and knows that semantic dependent field function that net provides is to calculate the degree of correlation between word.Phrase semantic degree of correlation rel (word ₁, word ₂) computing formula is as follows:

rel ({word}_{1}, {word}_{2}) = \frac{| conRel ({word}_{1}) \cap conRel ({word}_{2}) |}{| conRel ({word}_{1}) \cup conRel ({word}_{2}) |} - - - (1)

Wherein, | conRel (word ₁) ∩ conRel (word ₂) | be word ₁and word ₂the number that two word dependent field are occured simultaneously; | conRel (word ₁) ∪ con Rel (word ₂) | be word ₁and word ₂the number of two word dependent field unions.

Respectively there are more than 3000 bars owing to comprising commendation, derogatory term in existing emotion word dictionary, if using it all as seed word set, so will there will be the excessive problem of calculated amount at the polarity chron of the potential emotion word of prediction.For this reason, the present invention chooses some representational words seed word set the most respectively from commendatory term dictionary Pdict and derogatory term dictionary Ndict.Wherein, commendation benchmark word set derogatory sense benchmark word set

Obviously, selected benchmark word must be have intense emotion tendentiousness, and representative word.According in " calculating in the lexical semantic tendency based on HowNet " literary composition of the scholars such as the Zhu Yan haze carried at " Journal of Chinese Information Processing " the 20th volume the 1st periodical in 2006, the frequency of utilization of emotion word can as the whether representative important indicator of emotion word.Therefore, one group of the highest word of word frequency in emotion word dictionary can be chosen as benchmark word set, take into account benchmark word simultaneously and knowing the semantic distribution situation in net, it is evenly distributed in as far as possible and knows in net semantic tree.Based on the research of the scholars such as Zhu Yanlan, the present invention have chosen 40 pairs of commendations, derogatory term as benchmark word word set, as shown in table 2 and table 3:

Good

Happy

Healthy

Beautiful

Ripe

Insurance

Health

Perfect

Hero

Selected

Safety

Authority

Stable

Outstanding

Senior

Elite

Best

Happiness

Easily

Master-hand

Civilization

Actively

Famous

Beautiful

Perfect

Simply

Peace

Enlightened

Truly

Advanced

Cheaply

High-quality

Happy

Fine

Well

Remarkably

Super

Angel

Table 2 commendation benchmark word

Bad

Mistake

Mad

Accident

Disagreeable

Illegally

Failure

Behind

Trouble

Ugly

Patient

Maliciously

Pornographic

Violence

Yellow

Waste

Fall behind

Leak

Harmful

Hacker

Think highly of oneself

Uneasy

Devil

Style

Barbarous

Trap

Improper

Corrupt

Merciless

Error

Obscene

Rogue

False

Cruel

Abnormal

Fragile

Defective

Unwise

Badly

Demon

Table 3 derogatory sense benchmark word

The part of speech distribution of the benchmark word that seed words is concentrated will the polarity prediction effect of the unknown word of different part of speech in impact prediction method.In the benchmark word selected by the present invention, adjective occupies the overwhelming majority, and this distribution situation also conforms to the population distribution situation of actual polarity word.

As shown in Figure 3, the concrete grammar for a unknown word word Tendency Forecast is as follows:

1) first, judge unknown word word whether Already in emotion word dictionary, if exist, return the polarity that unknown word word is corresponding; If unknown word word does not exist in emotion word dictionary, then proceed to step 2).

2) in emotion word dictionary, commendatory term benchmark word set P is chosen _setwith derogatory term benchmark word set N _set, two word sets comprise the benchmark word of equal number; Wherein, commendation benchmark word set derogatory sense benchmark word set p _dictcommendatory term dictionary, N _dictit is derogatory term dictionary.

3) the semantic similarity sim (p, word) between commendation benchmark word p and unknown word word is calculated; (1) calculate the word degree of correlation rel (p, word) between p and word according to formula, the concrete account form of the word degree of correlation rel (p, word) between p and word is as follows:

rel (p, word) = \frac{| conRel (p) \cap conRel (word) |}{| conRel (p) \cup conRel (word) |} - - - (2)

Wherein, | conRel (p) ∩ conRel (word) | be the number that p and word two word dependent field are occured simultaneously; | conRel (p) ∪ con Rel (word) | be the number of p and word two word dependent field unions.

Tightness degree com (p, word) according between following formulae discovery p and word:

com(p,word)＝sim(p,word)+rel(p,word) （3）

Wherein, P _setrepresent commendation benchmark word set, p ∈ P _set, sim (p, word) represents the semantic similarity between p and word, and rel (p, word) represents the word degree of correlation between p and word.

4) the tightness degree com (n, word) between derogatory sense benchmark word n and unknown word word is calculated with reference to the method in step 3); Semantic similarity between n and word is sim (n, word), and the word degree of correlation rel (n, word) between n and word is as follows:

rel (p, word) = \frac{| conRel (p) \cap conRel (word) |}{| conRel (p) \cup conRel (word) |} - - - (4)

Wherein, | conRel (n) ∩ conRel (word) | be the number that n and word two word dependent field are occured simultaneously; | conRel (n) ∪ conRel (word) | be the number of n and word two word dependent field unions.

Tightness degree com (n, word) between n and word is:

com(n,word)＝sim(n,word)+rel(n,word) （5）

Wherein, N _setrepresent commendation benchmark word set, n ∈ P _set, sim (n, word) represents the semantic similarity between n and word, and rel (n, word) represents the word degree of correlation between n and word.

5) calculate unknown word word and pass judgement on the difference senti (word) of the tightness degree between two benchmark word sets:

senti (word) = \underset{p &Element; P_{set}}{Σ} com (p, word) - \underset{n &Element; N_{set}}{Σ} com (n, word) - - - (6)

Wherein, com (p, word) represents unknown word word and described commendation word set P _setin tightness degree between certain word p, represent unknown word word and described commendation word set P _setall words between tightness degree sum, com (n, word) represents unknown word word and described commendation word set N _setin tightness degree between certain word n, represent unknown word word and described derogatory sense word set N _setall words between tightness degree sum.

6) for unknown word word, according to the difference senti (word) of the tightness degree between unknown word word and two benchmark word set, choose suitable threshold space, and judge word polarity Polarity(word according to following algorithm):

Polarity (word) = \{\begin{matrix} 1, Senti (word) > b \\ 0, a \leq Senti (word) \leq b \\ - 1, Senti (word) < b \end{matrix}\} - - - (7)

Wherein, senti (word) represents the difference of the tightness degree between unknown word word and described commendation word set and the tightness degree between unknown word word and derogatory sense benchmark word set, and a represents first threshold, and b represents Second Threshold; If the difference senti (word) of the tightness degree between unknown word word and two benchmark word set is greater than Second Threshold, the extreme value obtaining unknown word word is 1, then described unknown word is commendatory term; If the difference senti (word) of the tightness degree between unknown word word and two benchmark word set is not more than Second Threshold and is not less than first threshold, the extreme value obtaining described unknown word is 0, then described unknown word is neutral words; If the difference senti (word) of the tightness degree between unknown word word and two benchmark word set is less than first threshold, the extreme value obtaining described unknown word is-1, then described unknown word is derogatory term.

In actual applications, the emotion tendency that potential emotion word may have or praise or demote, but itself also may not have emotion tendency, and now its polarity number should be 0.If adopt single-point threshold value to carrying out Tendency Forecast, no matter whether so predicted word have emotion tendency, the polarity number all will being endowed or praising or demote, and obviously this and actual conditions are not inconsistent.

Therefore, the present invention determines the value of region threshold according to the value δ of single-point threshold value under optimal cases, specifically as the formula (8):

[a,b]＝[δ-0.5,δ+0.5] （8）

Compared to commendatory term and derogatory term, the ratio of neutral words shared by natural language is larger.And the tightness degree of this kind of word and commendation benchmark word set or derogatory sense benchmark word set is all very little, and also comparatively balanced with the word tightness degree of two word sets, and generally can drop in certain interval, this interval span is set to 1 by the present invention.So the present invention adopts the method for interval threshold can distinguish polarity word and nonpolar word more accurately, and makes prediction to the tendentiousness of polarity word.This completes the phrase semantic tendency prediction overall process based on world knowledge network.

Below according to passing judgement on the difference of benchmark word set to the word of different part of speech, use the word Tendency Forecast method based on world knowledge network proposed by the invention to judge unknown word polarity, its of checking judges effect.First test data source and the evaluation criteria of experiment is introduced, no matter test set due to commendation, derogatory term is on the whole or by parts of speech classification, test set is all unbalanced, if people praises for choosing, balance test collection of demoting also may exist biased in the process of choosing, therefore in experiment, the whole known word set of use is tested.Contain a large amount of commendations and derogatory sense word in the emotion word dictionary that the present invention builds, be suitable as the testing material of word Tendency Forecast.Emotion word distribution in dictionary is as shown in table 4:

Part of speech	Commendatory term (individual)	Derogatory term (individual)	Sum (individual)
				Adjective (ADJ)	2561	2107	4668
Adverbial word (ADV)	34	18	52
				Noun (N)	560	1124	1684
Verb (V)	12	67	79
				Common saying (EXPR)	1	2	3

Table 4 emotion word dictionary word distributes

Prediction for different part of speech and tendentious word is investigated by experiment, and for the evaluation and test of word Tendency Forecast, test paper examines method to the accuracy rate precision of word Tendency Forecast as the formula (9).

precision = \frac{| {Set}_{correct} |}{| {Set}_{total} |} - - - (9)

Wherein, | Set _correct| for predicting correct word number, | Set _total| the word sum of test.

The experiment test impact of different seed word set scale for polarity Forecasting Methodology, and the method accuracy rate in various situation.The present invention have chosen front 1,5,10,20,40 word in table 5 and table 6 as test benchmark word set, and under testing this several situation respectively, rate of accuracy reached is to single-point threshold value time best, and test result is as shown in table 5.

Different part of speech word Tendency Forecast result when table 5 adopts single-point threshold value

Above-mentioned experimental result is depicted as chart as shown in Figure 4, can clearly finds out from figure, for adverbial word and adjectival single-point accuracy rate all very high.The accuracy rate of noun, verb is relatively low.In fact, because adverbial word word set is very little, the cut-point finding threshold value is therefore easy to.Verb is due to negligible amounts, and skewness, and therefore actual accuracy rate is more difficult reflects.Even if when seed words number be 1 right, the present invention also can reflect preferably praise, the difference of derogatory term on semantic similarity.

By using the semantic similarity based on Chinese thesaurus to carry out Tendency Forecast to the word in test set, two kinds of method comparing results as shown in Figure 5.It is evident that from Fig. 5, based on the Average Accuracy of the semantic similarity computing method (being labeled as cilin) of Chinese thesaurus, comparatively the present invention is low more than 12%.Its main cause is: the semantic similarity computing method based on Chinese thesaurus utilize the synonym clump in Chinese thesaurus, calculate the path distance of word in synonym woods, namely only considered semantic similarity, and in the method that the present invention uses, while considering semantic similarity, combine the degree of association that word is mutual, therefore accuracy rate obtains lifting.

But single-point threshold value judges obviously to have improperly to work as the polarity of unknown word, therefore also needs the accuracy rate investigating region threshold.In order to verify the word Tendency Forecast adopting region threshold and the accuracy rate impact adopting the word Tendency Forecast of single-point threshold value to judge unknown word polarity, done corresponding contrast, result as shown in Figure 6.As can be seen from Figure 6, adopt the effect of single-point threshold value relative to adopting the effect of region threshold slightly well, its main cause be test data used in the present invention all from emotion word dictionary, itself there is obvious emotion tendency.But in actual applications, often can run into some neutral words, if adopt the method for single-point threshold value, these words are often endowed unnecessary tendentiousness, so the relative single-point threshold value of region threshold has more practical significance.For avoiding the emotion tendency this kind of word being given mistake, therefore, the present invention adopts region threshold to carry out Tendency Forecast to word.

Above the phrase semantic tendency Forecasting Methodology based on world knowledge network provided by the present invention is described in detail.For one of ordinary skill in the art, to any apparent change that it does under the prerequisite not deviating from connotation of the present invention, all by formation to infringement of patent right of the present invention, corresponding legal liabilities will be born.

Claims

1., based on a phrase semantic tendency Forecasting Methodology for world knowledge network, it is characterized in that comprising the steps:

(2) choose commendation word set and derogatory sense word set, described commendation word set is identical with the benchmark word quantity that described derogatory term is concentrated;

(3) calculate the tightness degree between described unknown word and described commendation word set, described derogatory sense word set respectively, described tightness degree is semantic similarity between word and the word degree of correlation two kinds of index sums;

(4) difference of the tightness degree between described unknown word and described commendation word set and the tightness degree between described unknown word and described derogatory sense word set is calculated;

(5) difference according to step (4), selected threshold space judges the polarity of described unknown word, and described threshold space [a, b] is determined by following formula:

[a,b]＝[δ-0.5,δ+0.5]

2., as claimed in claim 1 based on the phrase semantic tendency Forecasting Methodology of world knowledge network, it is characterized in that:

Described emotion word dictionary is obtained by the former relation of justice of traversal world knowledge network.

3., as claimed in claim 1 based on the phrase semantic tendency Forecasting Methodology of world knowledge network, it is characterized in that:

Described commendation word set is one group of word that in described emotion word dictionary, commendatory term word frequency is the highest, and described commendation word set is evenly distributed on to be known in net semantic tree;

Described derogatory sense word set is one group of word that in described emotion word dictionary, derogatory term word frequency is the highest, and described derogatory sense word set is evenly distributed on to be known in net semantic tree.

4., as claimed in claim 1 based on the phrase semantic tendency Forecasting Methodology of world knowledge network, it is characterized in that:

Described unknown word and described commendation word set or described derogatory term concentrate the tightness degree com (p, word) between certain word p to represent, are calculated by following formula:

com(p,word)＝sim(p,word)+rel(p,word)

Wherein, word represents unknown word, p represents commendation benchmark word or derogatory sense benchmark word, Pset represents commendation word set or derogatory sense word set, p ∈ Pset, sim (p, word) represents the semantic similarity between p and word, rel (p, word) represents the word degree of correlation between p and word.

5., as claimed in claim 4 based on the phrase semantic tendency Forecasting Methodology of world knowledge network, it is characterized in that:

Word degree of correlation rel (p, word) between commendation benchmark word or derogatory sense benchmark word p and unknown word word is calculated by following formula:

rel (p, word) = \frac{| conRel (p) \cap conRel (word) |}{| conRel (p) \cup conRel (word) |}

Wherein, | conRel (p) ∩ conRel (word) | be the number that p and word dependent field is occured simultaneously; | conRel (p) ∪ conRel (word) | be the number of p and word dependent field union.

6., as claimed in claim 1 based on the phrase semantic tendency Forecasting Methodology of world knowledge network, it is characterized in that:

The difference of the tightness degree between described unknown word and described commendation word set and the tightness degree between described unknown word and derogatory sense word set represents with senti (word), is calculated by following formula:

senti (word) = \underset{p &Element; P_{set}}{Σ} com (p, word) - \underset{n &Element; N_{set}}{Σ} com (n, word)

Wherein, word represents unknown word, and p represents commendation benchmark word, and Pset represents commendation word set, p ∈ Pset,

Com (p, word) tightness degree in unknown word word and described commendation word set Pset between certain word p is represented, represent tightness degree sum between unknown word word and all words of described commendation word set Pset, n represents derogatory sense benchmark word, Nset represents commendation word set, n ∈ Nset, com (n, word) represent the tightness degree in unknown word word and described commendation word set Nset between certain word n, represent tightness degree sum between unknown word word and all words of described derogatory sense word set Nset.

7., as claimed in claim 1 based on the phrase semantic tendency Forecasting Methodology of world knowledge network, it is characterized in that:

Describedly choose suitable threshold space, judge that the step of unknown word polarity judges word polarity by following algorithm:

Polarity (word) = \{\begin{matrix} 1, & Senti (word) > b \\ 0, & a \leq Senti (word) \leq b \\ - 1, & Senti (word) < a \end{matrix}\}

Wherein, word represents unknown word, and senti (word) represents the difference of the tightness degree between unknown word word and described commendation word set and the tightness degree between unknown word word and derogatory sense word set;