CN103544242B

CN103544242B - Microblog-oriented emotion entity searching system

Info

Publication number: CN103544242B
Application number: CN201310461443.6A
Authority: CN
Inventors: 郝志峰; 温雯; 蔡瑞初; 杜慎芝; 陆印章; 程杰
Original assignee: Guangdong University of Technology
Current assignee: Beiming Software Co ltd; Guangdong University of Technology; Foshan University
Priority date: 2013-09-29
Filing date: 2013-09-29
Publication date: 2017-02-15
Anticipated expiration: 2033-09-29
Also published as: CN103544242A; DE112013004082T5; WO2015043075A1

Abstract

The invention relates to a microblog-oriented emotion entity searching system. The emotion entity searching system comprises a user interface (1), a query expansion module (2), a query processing module (3), an emotive information mining module (4), an emotive information judging and index building module (5) and a reverse index building module (6). The user interface (1) is used for interaction between a user and the system, and the user can submit a query request through the user interface and obtain a feedback result; the query expansion module (2) is used for carrying out word relation mining on microblog corpus data and building a weighting word relation graph in combination with a WordNet ontology base; the query processing module (3) is used for converting the query request of the user into query key words or query statements and for carrying out query expansion on the basis of the word relation graph built by the query expansion module (2), wherein the query key words or the query statements can be accepted by an index base; the emotive information mining module (4) is used for performing emotion mining on the microblog corpus base and generating a judging rule for emotion entities and emotion polarities; the emotive information judging and index building module (5) is used for judging the emotion entities and emotion polarities, building an emotive information index and storing the emotive information index; the reverse index building module (6) is used for building a reverse index for microblog text information and storing the reverse index. The microblog-oriented emotion entity searching system solves the problems that difficulty exists in microblog emotion entity extraction, emotion polarity analysis, emotion entity search and the like, and a novel intelligent searching product is provided for analyzing and monitoring social networking public opinions.

Description

Emotion entity search system towards microblogging

Technical field

The present invention relates to text emotion excavates and information retrieval field is and in particular to a kind of emotion entity towards microblogging is searched Cable system, belongs to the innovative technology of the emotion entity search system towards microblogging.

Background technology

In recent years, with the development of the Internet and social networkies, the social network data including microblogging is just with index Form quickly increases.The continuous growth of microblogging makes people's retrievable information more and more abundanter, but the microblog data of magnanimity Also the information required for making people be difficult to quickly and accurately find.Simultaneously as the freedom in microblogging style of writing, emotion letter The extraction of breath is increasingly difficult with respect to traditional text, in the microblogging feelings significant to public sentiment monitoring and investigation and research of products industry Sense information retrieval field, yet there are no technology and the system of maturation.

Emotion entity search method and system towards microblogging relates generally to the related crucial background technology of three classes.One is Query expansion technology；Its two be emotion entity extraction technique；Its three be feeling polarities discrimination technology.Individually below to above-mentioned three classes Background technology is illustrated by respectively and analyzes.

1 query expansion technology

The conventional retrieval system directly inquired about by key word or search engine can obtain the retrieval of some correlations As a result, but the result searched of this mode using simple match more machinery is it is impossible to get a real idea of the query intention of user, The result returning also just cannot be satisfactory.Therefore find the query intention that a kind of method can be very good to understand user, improve The precision ratio of retrieval and recall ratio become the focus solving the above problems.A kind of exactly such method of query expansion technology.Logical Cross query expansion and can more accurately understand user's query demand, help user faster and more accurately to obtain the information of needs.Warp The enquiry expanding method of allusion quotation mainly includes associating with being based on based on global analysises, based on partial analysis, based on user's inquiry log Four kinds of rule.In recent years, scholar is had to propose the enquiry expanding method based on body (or domain body) and semantic net.

It is by excavating in the document of whole set of data or whole data base based on the enquiry expanding method of global analysises Word degree of association is extended.Have an advantage in that and whole data set sufficiently can be analyzed, will appreciate that document Every aspect；Its shortcoming is, because common data set is all excessive, the time therefore to analysis and the requirement of equipment are all very high, Unlikely complete online.Existing searching system is all in the analysis completing overall word offline, searches in real time for demand Index is held up and is even more difficult in this way.

Relevant feedback and two kinds of pseudo-linear filter are included based on the method for partial analysis.Relevant feedback is to first pass through user Initial query, obtains retrieval result, then again by user's artificial judgment result document correlation to uncorrelated, point be put in two not Same document sets.Thus obtain the relevant documentation of labelling, only need to before making query expansion to carry out word to these documents divide Analyse.Advantage of this is that and only process the document of relevant portion so that number of documents decreases, and degree of association also has Say lifting；Its shortcoming is to need substantial amounts of manual feedback, and this needs substantial amounts of manpower, and there is still a need for substantial amounts of experiment is carried out Debugging is processed.So existing searching system or search engine are rare in this way.

Pseudo-linear filter method is to be analyzed using the front n piece result that user inquires about acquisition for the first time, and its theory hypothesis is The document thinking related to query word in result appears in the foremost of retrieval, that is, thinks that these documents are exactly degree of association Highest document, obtains expansion word and carries out query expansion by analyzing these documents.Number of patent application is CN20091032193.5, invention entitled " enquiry expanding method and query expansion system " is exactly special using pseudo-linear filter Sharp example.Its main thought is to pass through cluster analyses and generate by user inquires about acquired results forward partial document for the first time Cluster, after cluster is ranked up, then extracts expansion word from ranking previous fixed number purpose cluster, the expansion word of gained is added to former In inquiry, form expansion word and combine and then carry out quadratic search.The shortcoming of this method is it cannot be guaranteed that first inquire about Forward document is related, if if uncorrelated, the expansion word that draws may be such that the result of quadratic search more Uncorrelated, retrieval performance will reduce.

It is a kind of general extended method of present search engine based on the method for user's inquiry log, the method is by right The inquiry log of user carries out word analysis, using the word of co-occurrence as expansion word.Number of patent application is CN200710097501.6, invention entitled " enquiry expanding method and device and coordinate indexing dictionary " and number of patent application are CN200810115470.7, invention entitled " a kind of method of expanding query, device and search engine system " is exactly to user The query word of input is analyzed obtaining the word of correlation, then using these words as expansion word.This extended method is first It is also required to obtain substantial amounts of inquiry log, this needs the process of an accumulation.

It is a kind of classical way of data mining based on the method for correlation rule, be often used for excavating the phase between affairs Guan Xing, can be used for various forms of resources in query expansion and is excavated, for example mining data document sets, inquiry log etc. Dependency between the word of resource.Number of patent application is CN201010605956.6, invention entitled " expanding user search results Method and server " be exactly the example carrying out query expansion with Association Rules Technology.This patent adopts a correlation rule number The rule establishing according to storehouse storage, the rule that this wants can make manual foundation can also make using support-Confidence Framework Correlation rule excavates to particular document, and the rule of production is saved in association rule database.Work as user input query During word, obtain the word related to this word first in rule database, then by former query word, the related term of acquisition and both Portmanteau word form new query word, and quadratic search is carried out to data base.The shortcoming of this method is the failure to by word Meaning aspect goes to understand a word, simply floats in the frequency aspect of word, such extension also cannot understand well The query intention of user.

Enquiry expanding method based on body or semantic net by using or build term network word is extended A kind of technology.This semantic network can be the network having built up, such as WordNet and HowNet；Can also be voluntarily Build, such as domain knowledge or domain body.Semantic net or ontology library have organized the multilamellar relation of word, such as apposition, on Hereafter position word, notional word, whole-part word etc. relation, is allowed to form a network with regard to word.Number of patent application is CN200810116729.X, invention entitled " a kind of semantic query expansion method based on domain knowledge " is to know first with field The analysis of knowledge and user's sentence feature, to build a domain knowledge base, then utilizes domain knowledge base content, to former query word Carry out semantic analysis, obtain a semantic item list, more expansible item is obtained by semantic computation；Finally extension is returned and look into Ask in set and quadratic search is carried out to data base.Number of patent application is CN20101084725.2, a kind of invention entitled " image Text based query expansion and sort method in retrieval " is, using WordNet net and HowNet net, word is carried out with semantic point Analyse and obtain the word of semantic extension, in the image indexing system to text analyzing, and invent a kind of returning result to be entered The algorithm of row Optimal scheduling.By semantic extension, can cognitive user very well query intention, but the expansion word of this method Data base to be checked is not analyzed, retrieval performance would generally be very limited；And set up field ontology library i.e. arduously again Time-consuming.

2 emotion entity extraction techniques

Emotion object is exactly the object of emotional expression effect, a usually noun or nominal phrase.Under normal circumstances If it is not known that emotion object, and it is nugatory for only carrying out Sentiment orientation analysis and research.The extraction of emotion object is made Extremely important also have much the pass having the challenging task person that obtains correlational study for one in sentiment analysis and opining mining simultaneously Note.Although having had many emotional expressions and the research of emotion object aspect at present, they are for product review mostly Information or news information are analyzed.

Different from traditional text message, the freedom that microblogging is composed a piece of writing due to restriction and the network of system number of words, microblogging number According to due to the reason such as number of words restriction and style of writing freedom, it is allowed to the expression containing a large amount of breviaries, wrong word, special symbol（As table Feelings symbol, link etc.）Etc. all kinds of literal expressions different from traditional specificationses, these undoubtedly all improve the difficulty of data analysiss. Due to domestic sentiment analysis and opining mining start late and Chinese and English diversity, in addition correlation technique is jejune Limit, the research carrying out emotion Object identifying aspect currently for microblogging is also fewer.

Existing emotion Identifying Technique of Object has the Patent No. of BJ University of Aeronautics ＆ Astronautics's application at present CN201210317183.0, the patent of invention entitled " the viewpoint abstracting method based on word dependence relationship ".The method adopts Matching algorithm based on word dependence relationship chain extracts evaluation object, does not use other more available auxiliary information raising sides The accuracy of method, next the method is not necessarily suited for this special text message of microblogging.

Common emotion object extraction in existing list of references mainly for product review carry out, due to there being appointed product Information and field limit, and problem is more specific, clear, and the extraction work of therefore theme related text is often attained by relatively good Effect.But effect is not good in the unrelated text of other themes, it is very miscellaneous that this essentially consists in these texts comment object, In addition emotion word is also diversified.Carry out emotion Identifying Technique of Object currently for the unrelated microblogging of theme seldom, existing method Be mostly directly by microblogging is carried out syntax dependence analysis obtain with reference to sentiment dictionary paired<Emotion word, emotion pair As>Relation, thus extract emotion object.The recognition effect of this method is less desirable, there is following weak point：（1）This extraction process is too dependent on sentiment dictionary and specific several syntax dependence, on the one hand, due to sentencing based on dictionary Disconnected method is limited, and is affected very big by domain knowledge, therefore can there are a lot of erroneous judgements；On the other hand, microblogging literary composition The particularity of word expression, emotion word and emotion object are not necessarily confined to specifically several dependences；（2）In microblogging In, usually some emotion word and its emotion object do not have directly paired appearance in the text, and only emotion word shows emotion feelings Sense tendency, and emotion object not dominant occur in sentence, then this extraction process can not extract some not directly Occur in the emotion object in sentence text.

3 feeling polarities discrimination technologies

Have sentiment analysis system and technology at present and focus primarily upon chapter rank and sentence from the granularity of analysis The sentiment analysis of rank, and Entity recognition and sentiment analysis are divided into two solely by the sentiment analysis technology of the entity level of only a few Vertical task is carrying out.From the point of view of the object of analysis, current system and technology will be directed to the review information such as news, microblogging, concern Analysis in Social Public Feelings.

Existing chapter rank and sentence level sentiment analysis technology mainly have at present：The Application No. of Northwestern Polytechnical University CN200910219161.9, the patent of invention entitled " the WEB text emotion subject identifying method based on mixed model "；China Application No. CN200910083522.1 of academy of science's Institute of Computing Technology, invention entitled " emotion tendentiousness of text analysis The patent application of method "；Application No. CN201210088366.X of Institute of Automation Research of CAS, invention entitled The patent application of " a kind of sentiment analysis method towards microblogging short text "；The Application No. of Fujitsu Ltd. CN201010157784.0, the patent application of invention entitled " emotional orientation analytical method and device ".

Above-mentioned sentiment analysis technology mainly includes training and two steps of Judgment by emotion, thinks Northwestern Polytechnical University below Its key step in training and Judgment by emotion is introduced as a example " the WEB text emotion subject identifying method based on mixed model ", Remaining correlation technique is substantially similar.The method mainly includes following step：1st, manual mark is carried out to the text in training set Note, estimates two class emotion models：" commendation " model and " derogatory sense " model；Language performance side according to different themes text simultaneously Formula, estimates all kinds of topic language models respectively；2nd, adopt maximal possibility estimation（MLE）The emotion mould that method is set up for step 1 Type and topic model carry out parameter estimation respectively；3rd, for pending text, its language model and two class emotion models are calculated Distance, thus judging to the emotion tendency and theme of text.

Current emotion tendency technology focuses primarily upon chapter rank and sentence level, the method based on machine learning Popularize very much, and the sentiment analysis technology based on emotion drop point is little.

The existing deficiency being primarily present three below aspect based on the sentiment analysis technology of emotion word：A）Emotion phrase Extract the modification not accounting for adverbial word, but generally adverbial word all this kind of emotion word can produce degree restriction work to adjective With.If do not taken in, easily cause emotion strength variance；B）The identification of negative word and process problem, general method is Take a kind of strategy of search to go to search negative word, be difficult to determine the object of negative；C）Some are based on the emotion word automatically generating Intensity dictionary is unreliable, because emotion word intensity is the base attribute of emotion word, is mainly determined by its original idea.

Content of the invention

It is an object of the invention to overcoming existing emotion entity search technology above shortcomings, proposing one kind and improving The emotion entity search system towards microblogging of the accuracy rate that feeling polarities judge.

The present invention is achieved through the following technical solutions：The present invention towards the emotion entity search system of microblogging, including following 5 Individual module：

1）User interface, for interacting of system and user, user can submit inquiry request to by this module and obtain Feedback result；

2）Query expansion module, for carrying out word relation excavation to microblogging corpus data, and combines WordNet ontology library Set up weighting word graph of a relation；

3）Query processing module, for being converted to the index receptible searching keyword of place by user's inquiry request and looking into Ask sentence, and be based on module 2）The word graph of a relation building carries out query expansion；

4）Emotion information excavates module, for carrying out emotion excavation to microblogging corpus, and generates emotion entity and emotion The decision rule of polarity；

5）Emotion information judges and module set up in index, for carrying out emotion entity to microblog data and feeling polarities are sentenced Fixed, set up emotion information index, and stored；

6）Inverted index sets up module, for setting up inverted index to microblogging text message, and is stored.

Above-mentioned module 1）Middle realize query expansion using following steps:

11）Data in microblogging corpus is carried out with dependency rule excavation, output dependency rule excavates obtained correlation Word set；

12）In conjunction with 11）The frequent episode that obtained and and WordNet ontology library, build weighting word graph of a relation.

Above-mentioned steps 11）Middle employing Eclat algorithm excavates the frequent item set of microblogging corpus and generates related word set, and will Related word set and WordNet ontology diagram pass through the form such as mapping or insertion and form weighting word graph of a relation；

During above-mentioned structure weighting word graph of a relation, the computational methods of node weights are：

f(d)=deg(d)=deg⁺(d)+deg^-(d),

Wherein deg (d), deg⁺(d)、deg^-D () represents degree, out-degree and the in-degree of node respectively；The computational methods of side right weight For：

Above-mentioned module 3）Middle realize query processing using following steps:

31）The query word of receiving user's input or sentence；

32）Input to user carries out participle, the process removed stop words and determine centre word, obtains one or more centers Word；

33）Centre word is chosen suitable expansion word in the weighting word graph of a relation storehouse by body and regular word construction, And weight calculation is carried out to expansion word；

34）Then the big front p word of weight selection is added to inquiry set of words, and expansion word set is inputted to inquiry Interface.

Above-mentioned steps 33）Using following methods, weight calculation is carried out to expansion word：

Assume that former query word is q=(q₁,q₂,…,q_m), its middle term q_iThere is n_iIndividual closest wordThen Former query term q_iWith closest lexical item d_ijDegree of association by computational methods be

Wherein W (q_i,d_ij) it is word q_iWith word d_ijDegree of association, g (q_i,d_ij) be two words weights, f (d_ij) it is word d_ij's The number of degrees, the weighing computation method of all closest words is

Above-mentioned module 4）Middle identification and the judgement realizing emotion entity using following steps:

41）Gather representative microblog data;

42）Pretreatment is carried out to the microblog data collecting, including being carried out to data, convert, subordinate sentence, participle, word Property mark and syntax parsing etc.；

43）Feature extraction is carried out to microblog data, is expressed as characteristic vector；

44）Training emotion entity recognition model, obtains model parameter；

45）Output emotion entity decision model simultaneously stores.

Above-mentioned steps 43）Middle realize feature extraction using following methods：In conjunction with word context, design packet contains global characteristics In interior Custom Dictionaries, according to Custom Dictionaries, feature extraction is carried out to microblog data, microblog data is converted into emotion real The input data form that body identification model can be processed.

Above-mentioned steps 44）Middle realize emotion entity recognition model using following methods:In condition random field（CRF）In model Introduce global characteristics node, set up the GLCRF model combining global characteristics, and obtain model ginseng using L-BFGS Algorithm for Training Number.

Above-mentioned module 5）The middle judgement realizing microblog emotional polarity using following steps：

51）Microblog data noise remove and semantic form conversion;

52）Participle, part of speech labelling and Chinese syntax parsing;

53）Extract emotion phrase in conjunction with sentiment dictionary;

54）Emotion phrase filters;

55）Feeling polarities judge and result output.

Above-mentioned steps 53）Middle employing sentiPY method extracts emotion phrase, and the unity of form of emotion phrase is expressed as phrase:Modifier*sentiment, that is, a phrase include a center emotion word, may attach simultaneously multiple modify secondary Word；

Above-mentioned steps 55）Middle using the complex decision algorithm based on emotion drop point, microblog emotional polarity is judged, sentence Determine process to comprise the steps of

551）Judge whether there is summary word in sentence, such as no, go to step 552）;If any then to summarize the sentence after word As emotion drop point, emotion drop point polarity is exported as microblog emotional polarity;

552）Using microblogging beginning of the sentence and sentence tail as emotion drop point, compare beginning of the sentence, sentence tail feeling polarities, if both feeling polarities Cancel out each other, then turn 553）;Otherwise, feeling polarities are exported as microblog emotional polarity compared with powerhouse；

553）Calculate the emotion word intensity of whole piece microblogging, summation is simultaneously average, and mean intensity is entered as microblog emotional polarity Row output.

The present invention is directed to the query expansion scheme of microblog emotional entity search, is characterised by carrying out word to microblogging corpus data Language relation excavation, sets up weighting word graph of a relation in conjunction with WordNet ontology library, and is looked into according to constructed word graph of a relation Ask extension, to be better understood from the query intention of user；The present invention solves Ontology and language material word in terms of query expansion The problem that language relation effectively combines, can be better understood from the inquiry purposes of user, so query statement is converted into more suitable Query expansion word；In terms of the extraction of emotion entity and emotional color analysis, solve microblogging this kind of style of writing degree of freedom larger The extraction of text emotion object and the decision problem of feeling polarities, the entity in the case of the emotion Objects hide solving extracts and asks Topic, optimizes the extraction effect of emotion entity, improves the accuracy rate of feeling polarities judgement simultaneously.For network public-opinion monitoring and product Product the analysis of public opinion provides a kind of excellent solution.The present invention solve microblog emotional entity extract, feeling polarities analysis and The difficult problems such as emotion entity search, are social networkies the analysis of public opinion and monitoring provides a kind of searching products intelligently.

Brief description

Fig. 1 is overall structure figure of the present invention；

Fig. 2 is that the enforcement of the present invention uses flow chart；

Fig. 3 is the system building framework map of the present invention；

Fig. 4 is the flow chart of the feeling polarities analysis method of the present invention；

Fig. 5 is the graph structure example in emotion strength optimization based on neighbouring relations；

Fig. 6 is emotion calculation method of impact flow chart；

Fig. 7 is microblog emotional object extraction workflow diagram；

Fig. 8 is data prediction flow chart；

Fig. 9 realizes schematic diagram for the training of emotion object model；

Figure 10 is the graph structure of GLCRF model；

Figure 11 is the model graph structure after the multiple global node of GLCRF Model Extension.

Specific embodiment

Below in conjunction with accompanying drawing, embodiments of the present invention are described further, but the enforcement not limited to this of the present invention.

Fig. 1 show overall structure figure of the present invention.A kind of emotion entity search system towards microblogging, including:User connects Mouth module, user can submit inquiry request to by this module and obtain feedback result；Query expansion module, realizes to microblogging language Material data carries out word relation excavation, and combines WordNet ontology library foundation weighting word graph of a relation；Query processing module, uses Index the receptible searching keyword of place and query statement in being converted to user's inquiry request, and be based on query expansion module The word graph of a relation building realizes query expansion；Emotion information excavates module, for emotion excavation is carried out to microblogging corpus, and Generate the decision rule of emotion entity and feeling polarities；Emotion information judges and index sets up module, for entering to microblog data Market sense entity and the judgement of feeling polarities, set up emotion information index, and are stored；Inverted index sets up module, is used for Inverted index is set up to microblogging text message, and is stored.

Fig. 2 shows the workflow diagram of query processing module of the present invention.

With reference to Fig. 2, this flow process comprises the following steps：1st, the query word of query interface receiving user's input or sentence；2nd, warp Cross query script and the input of user is carried out with participle, the process removed stop words and determine centre word, obtain one or more centers Word, centre word can be that key word can also make the types such as qualifier；3rd, by centre word in adding of being constructed by body and regular word Suitable expansion word source is chosen, the word distance of selection is 1, is the closest word of centre word in power word relation picture library； 4th, because the expansion word of the 3rd step gained may be a lot, therefore in order to weigh the importance of each word, each word is carried out Weight calculation, then the big front p word of weight selection be added to inquiry set of words in；5th, needed for the 4th step has been obtained for The expansion word wanted, but be intended to introduce a mechanism user can be allowed to understand these expansion words, and word is operated, that is, The inquiry set of words that modification is extended is so that expansion word all meets user's query intention；6th, by expansion word set return inquire about into Mouthful, rich media data storehouse is extended retrieve；7th, the result of retrieval is returned and be shown to user.

Fig. 3 shows the query processing of the present invention and the integration details of query expansion module.

With reference to Fig. 3, the query processing of the present invention and query expansion includes background information processing procedure and retrieving two is big Part, is wherein further divided into micro-blog information abstraction module, sets up index module, builds word relation module, user search Module and manager's operation and the big submodule of user operation module five.

The process of micro-blog information abstraction module includes organizing initial microblog data, it is carried out with suitable cleaning, divides Sentence, participle and syntactic analysiss.Mainly setting up an index to microblog data collection supplies quick-searching to set up index module.We adopt Set up inverted index with Lucene.Lucene is the framework of a full-text search engine increased income, there is provided complete inquiry Engine and index engine, support boolean operation, fuzzy query, Querying by group etc. operation.Establish inverted index with it and protect Deposit.

Build the core that word relation library module is this paper, be also the part of innovation.This part is divided into participle mistake Journey, Eclat dependency rule mining process, dependency rule word generating process and combine WordNet generate weighting word graph of a relation Process.Participle process is exactly that the literal resource of text is divided into word one by one.We are using higher to Chinese word segmentation accurate rate ICTCLAS software carry out participle, this be the Chinese Academy of Sciences research and development the system being specifically designed for Chinese word segmentation.Our first one by one logarithms Document according to collection carries out participle, then more various types of document is combined one document sets of formation, so that dependency rule digs Pick uses.In the mining process of dependency rule, we adopt the higher Eclat mining algorithm of digging efficiency, this be one deep Spend preferential algorithm, big document finally can be remerged with the excavation related term of piecemeal.The present invention uses support-interest The dependency rule framework of degree, this framework adopts two judge formula：

(1), support formula：

(2), interest-degree formula：

Wherein | X ∪ Y | is the number of transactions simultaneously comprising X and Y, and | D | is the affairs sum of data base；Supp (X ∪ Y) is several Comprise the percentage ratio of X and Y according to affairs in storehouse, supp (X), supp (Y) represent that affairs only comprise X and only comprise Y's respectively simultaneously Percentage ratio.

Set different support threshold according to different document sets in mining process, and the frequent item set excavated is only Have when interest-degree is more than 1 and just produce dependency rule item.As long as because it is considered herein that when two words interest-degree be more than 1 when he Be only positively related.Also added the concept of compound word in mining process：When the interest level of two words is more than 4, by this Former and later two words of individual regularization term merge generation portmanteau word, and this word forms new rule with the former piece of regular word and consequent respectively Then, the interest level of new regulation is identical with meta-rule, and such compound word also can be selected as expansion word.Dig in correlation word Dependency rule word will be produced after excavating and preserve, the form of preservation will be the form of " X Y ".Now complete dependency rule word Excavation and analysis.

A remaining step is that these regular words and WordNet ontology library are combined into a weighting word graph of a relation. WordNet is the semantic network based on vocabulary.Vocabulary is not only organized into concept by WordNet, also defines between concept, vocabulary Multiple semantic relationes (as apposition, up/down position word, antonym, whole-part word, implication etc.), word is formed with the relation of word One directed graph (as the example of Fig. 3).This process is it is contemplated that mapping regular lexical item in sequence or being added to In WordNet ontology library, the structure principle that we set weighting word graph of a relation is：Add one between the node of two regular words Bar points to the directed edge of consequent by former piece.The interpolation full automation of wherein regular word, is divided into two kinds of situations：First, if former There is this word in WordNet ontology diagram, then only word need to be mapped to figure, then update node data;Second, if former There is not this word in WordNet ontology diagram, then first add word, then add side and update the data.All node data exist Count one by one after the completion of figure.The graph of a relation ultimately forming can be represented with four restructuring：G=<V,E,f,g>, wherein V is node Set, E is the set on side, and f is the function from V to nonnegative real number set, is set to the number of degrees of node;G is to nonnegative real number collection from E The function closing, is set to the value on two node sides.If d, d_i,d_j∈ V, deg (d) represent node d degree (i.e. the out-degree of this node and In-degree sum), lift (d_i→d_j) represent node word d_i、d_jInterest level, then have：

(1)、f(d)=deg(d)

In weighting word graph of a relation (as the example of Fig. 4), word whole in figure significance level by this word place node Degree is weighed, i.e. the out-degree of node and in-degree sum (the other integer value of node in Fig. 4);The value on side is weights, its Central Plains WordNet Weights between this pronouns, general term for nouns, numerals and measure words of figure are set to 1 (blue side in Fig. 4), and the weights between the word being inserted by rule are set to the interest-degree of two words Value (blue side in Fig. 4), if two words are WordNet relational word and regular word, weights add 1 for interest level.In Fig. 4 The word of black side indication is compound word (as " intellectual property "), and it is identical with the weights of two regular words.Now complete The structure of weighting word graph of a relation.

User search module includes inquiring about input, query analysis process, coupling extension terms process, generates expanding query word Aggregation process, search index process and result treatment are simultaneously shown to the process of user.Inquiry input is exactly to connect in query interface Receive query word or the sentence of user input；Query analysis are that the input of user carries out participle, removes stop words and determine centre word Process, obtain one or more centre words；Coupling extension terms process is that the centre word of previous step is input to weighted words Suitable expansion word source is chosen, that is, (i.e. distance is 1 away from the nearest word of former query word from the selection of this in figure in language relation picture library Word) as candidate's expansion word.Generate the degree of association that expanding query set of words process is according to each word and former query word, calculate Before choosing after the weight of word, p is as final expansion word.Invention creates the formula calculating each term weighing, according to weighted words The structure of language graph of a relation understands:If the weights of two nodes are bigger, represent that the degree of association of this two nodes is also bigger；And if The degree of node is bigger, shows that the importance of this node is also bigger.

Assume that former query word is q=(q₁,q₂,…,q_m), its middle term q_iThere is n_iIndividual closest word d_i=(d_i1,d_i2,…,q_in_i),

Then former query term q_iWith closest lexical item d_ijDegree of association by computational methods be

Wherein W (d_k) it is word d_kWeight, m represents the number of former query word.In the weight calculating each candidate's expansion word Afterwards, weight is arranged in descending order, and choose front p word and be added in former inquiry, constitute expansion word set, its Central Plains query term Weight is all 1.

By previous step and the set of words that is expanded, as following form：

Q=(q₁,q₂,...,q_m,d₁,d₂,...,d_p) (4)

Retrieving refers to that expansion word set is returned inquiry entrance returns inquiry entrance, expands to rich media data storehouse Exhibition retrieval.The result treatment process showing refers to return the result of sorted retrieval and be shown to user.

Fig. 4 is the flow chart of feeling polarities analysis method proposed by the present invention.

With reference to Fig. 4, the method comprises the following steps：

(1) noise remove of comment language material and semantic form conversion：

The noise remove of comment language material is mainly removing interference clause's such as subjunctive mood.The sentence non-genuine visitor of these interference The evaluation seen, the analysis in stage after disturbing.Replacement emoticon is corresponding word, thus semantic form is converted into close friend The form processing.

(2) natural language processing：Mainly use Stanford NLP software and participle, part of speech labelling are carried out to comment language material And Chinese syntax parsing.

(3) combine sentiment dictionary and extract emotion phrase：

Because POS tagger label in comment language material for the emotion word is concentrated mainly on above a few label, We just combine these part of speech labels and sentiment dictionary extracts emotion phrase.SentiPY method using our exploitations extracts feelings Sense phrase, in the unity of form of the system emotion phrase be：

phrase:modifier*sentiment

, that is, a phrase include a center emotion word, multiple modification adverbial words may be attached.

(4) emotion phrase filters：The coarseness emotion phrase extracting in 3rd step is filtered so that emotion phrase Form is purer, such that it is able to lift the accuracy of final polarity classification.

(5) sentiment analysis result is exported

We devise a complex decision algorithm based on emotion drop point, and this algorithm can be effectively to different field Comment language material is analyzed.

Fig. 5 is the graph structure example in emotion strength optimization based on neighbouring relations.With reference to Fig. 5, the feelings in comment language material The node of in figure regarded as in sense word, can calculate the emotion intensity of context based on the algorithm propagated.Based on sentiment dictionary, extract The adjacent relation of emotion word the weight by NGD calculating two emotion word nodes, thus form a directed graph.Figure three is one The graph structure of comment.

Fig. 6 is emotion calculation method of impact flow chart.With reference to Fig. 4, in this step, our target finds a comment Emotion drop point.So-called emotion drop point is exactly the emotion part that author mainly thinks expression in a comment.Our Main Basiss Recapitulative vocabulary（As " overall "）, compare beginning ending at emotion intensity and the strongest emotion phrase in sentence, thus looking for Emotion drop point to a comment.

Fig. 7 shows that the present invention is directed to the workflow diagram of microblog emotional entity extraction.

With reference to Fig. 1, the emotion entity of the present invention extracts and includes microblog data collection, data prediction, feature extraction, dictionary Loading, labelling and the step such as correction, model training and emotion object extraction.The microblogging number that microblog data collection crawls from the Internet According to saving in the form of a file, the emotion object extraction model that model training obtains also can be conserved for object Extract, the result that emotion object extraction obtains preserves in the form of by file, so that user checks and correction predicts the outcome.

Microblog data gathers, for the microblog system on the Internet（As Sina weibo, twtter and Tengxun's microblogging etc.） Crawl microblog data, and the microblogging collecting initial data is preserved down in the form of a file according to certain organizational form Come, the later stage for system processes offer data support.

Data prediction, anticipates for carrying out some to original microblog data, is easy to the later stage and carries out feature extraction. This module includes data cleansing, data conversion, subordinate sentence, participle, part-of-speech tagging and syntax parsing.Details are as shown in Figure 2.

Dictionary loads, for the related dictionary required for loading data pretreatment and characteristic extraction step, this dictionary bag Include the dictionary data such as sentiment dictionary, stop words dictionary, classical network dictionary.

Feature extraction, the dictionary data pair that load-on module loads with the help of a dictionary carries out pre-defined spy with the data after process The extraction levied, text vector is converted into the form that object extraction module can be processed.

Emotion object model is trained, and the emotion object extraction model for the system core is trained.From labelling and repairing Positive module obtains and is converted into the training data requiring form, using L-BFGS algorithm to the CRF model building according to training data It is trained.The CRF model that the present invention uses is in Linear CRF（Linearity condition random field）On the basis of model develop and Come, be CRF（Condition random field）Model first time is applied in emotion Object identifying field.By in traditional CRF model Middle interpolation global variable, thus reach can recognize that the not dominant situation about occurring in labelled sequence of emotion object.

Emotion object extraction, for extracting emotion emotion object from microblog data, this step mainly utilizes model to instruct Practice the model that trains of module to be predicted thus reaching the purpose of extracting object.

Labelling and correction, the CRF model used in the present invention has supervision statistical learning method it is therefore desirable to logarithm for one According to being labeled.It is simultaneously introduced feedback mechanism error analyses information is learnt.Existing method is general not for point result by mistake Deal with, but these feedback informations contain a large amount of useful informations, how to make full use of these information and become system realization The key of self-teaching.The introducing of feedback mechanism enables model that the result of error analyses is learnt again so that being System is more used more accurate.

Fig. 8 shows the schematic diagram of realizing of data prediction step of the present invention, and data prediction step comprises the following steps：

（1）Data cleansing process step, reads data from the original microblog data that data acquisition module is collected, enters line number Data cleansing process in Data preprocess, filters out some skies, invalid dirty microblog data.

（2）Data conversion processing step, this step process from（1）It is transmitted through the data come, to microblog data after step process In some contents carry out conversion processing, be easy to（3）（4）（5）（6）Step relevant treatment, common have following several situation：（a） Usually containing some information invalid to work in microblogging, then need to weed out；（b）Some useless chains for our work Connect（As image link and web page interlinkage etc.）Need to weed out with special string；（c）Band " # " symbol is usually included in microblogging Number topic and band " " symbol contact person be also carried out process, we microblogging head and tail occur topic and contact person straight Connect deletion, in microblogging sentence, then only delete " # " and "@" symbol；（d）Some emoticons are usually included in microblogging, this A little symbols are with intense emotion tendency, be also the helpful information of the work to us, but these symbols can affect Participle, part-of-speech tagging（POS marks）With the precision of syntax parsing, therefore need in the process to extract；（e）Need to micro- In rich, some cyberspeaks are changed, and for example, " V5 " of network expression way are changed into " powerful " of specification expression etc., this is same Sample is favorably improved participle, part-of-speech tagging（POS marks）Precision with syntax parsing.

（3）Microblogging text subordinate sentence process step, the conditional random field models of the emotion object identifying method of the present invention are structures It build the sequence mark of sentence level in, carry out information extraction, but a microblogging agrees to include more than one sentence, therefore Need to carry out subordinate sentence process to it.Mainly subordinate sentence is carried out according to punctuation mark in subordinate sentence processing procedure.But due to microblogging Particularity, it is inadequate for carrying out subordinate sentence only according to punctuate.In microblogging a lot of people for convenience, custom space or spy Different symbol（As "～" etc.）Carry out subordinate sentence, be therefore also directed to these situations in the process and carried out corresponding subordinate sentence process.

（4）Sentence word segmentation processing step, the conditional random field models of the emotion object identifying method of the present invention are to sentence In the sequence of rank, each word is marked it is therefore desirable to carry out word segmentation processing.What participle process was used is some conventional networks Term lexicon dictionary（As " going mad ", " surrounding and watching " etc.）For improving the accuracy of participle.

（5）The part-of-speech tagging step of word in sentence, this step carries out part-of-speech tagging to each word after participle, is the present invention Feature Selection Model carry out during feature extraction provide word part of speech correlated characteristic.

（6）Syntax analyzing step, this step parses the syntax between word in sentence using syntax analytical tool and relies on pass System, purpose carries out providing the dependence correlated characteristic of word during feature extraction for the Feature Selection Model of the present invention.

Fig. 9 realizes schematic diagram for emotion Object identifying model training step of the present invention.Reference Fig. 9, in this step, The microblog data that the training dataset of mark crawls from the Internet from data acquisition module, line number of going forward side by side Data preprocess mould Block is processed.Due to the condition random field adopting in the present invention（CRF）Model carries out emotion object extraction, and CRF model is one kind Supervised learning method, training dataset therefore in the training process also needs to carry out artificial labeled data collection.Training pattern During it is necessary first to using dictionary load-on module load user-oriented dictionary, including emotion word dictionary and stop words dictionary；Next step Exactly with reference to the dictionary of last loading, training dataset is carried out with feature extraction normalized number evidence using characteristic extracting module； Final step is that to upper step, normalized data carries out model parameter training using model training module, using L-BFGS algorithm instruction Practice the parameter that study obtains model.

The conditional random field models used in the present invention form as shown in Figure 10, regards emotion object recognition process as It is a sequence mark problem.The X of the ground floor of this model represents the microblogging sentence of input, x_iRepresent i-th position in sentence Word, the y of the second layer_iG with third layer₁、g₂Output result state, the label of these states willing can value be：L={'N- B', ' N-I', ' P-B', ' P-I', ' this five labels of O'}, it represents each position mark label of sequence during sequence mark Valued space, wherein the starting position label of N-B tag representation negative sense emotion object, N-I tag representation negative sense emotion object Follow-up label（It is that its previous label is necessary for N-B or N-I）, the starting position mark of P-B tag representation forward direction emotion object Sign, the follow-up label of P-I tag representation forward direction emotion object（Previous label is necessary for P-B or P-I in the same manner）, O label list Show other all labels, that is, have y_i∈L.Such as sequence is { " mobile phone ", " screen ", " very ", " clear " }, " mobile phone screen " be The emotion object of one forward direction, is { " P-B ", " P-I ", " O ", " O " } to the result that it is marked.

With two global node g in model₁And g₂Represent two independent single emotional objects, therefore value be only ' N-B', ' P-B', ' these three labels of O'}, or being P-B label for positive emotion object, or for negative sense emotion object being For N-B label, or not being that emotion object is O label, and can not possibly be follow-up label N-I and P-I of emotion object.

In order to improve motility and the expansibility of emotion Object identifying, the conditional random field models not office that the present invention adopts It is limited to the figure result shown in Fig. 9, represent non-and dominant be also not limited to two hiding node g₁And g₂, can be extended to as Figure 11 Shown g₁…g_n（n>=1）.

Particular embodiments described above, has carried out detailed further to the purpose of the present invention, technical scheme and beneficial effect Describe in detail bright, be should be understood that the specific embodiment that the foregoing is only the present invention, be not limited to the present invention, all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvement done etc., should be included in the guarantor of the present invention Within the scope of shield.

Claims

1. a kind of emotion entity search system towards microblogging it is characterised in that include below 6 modules：

1) Subscriber Interface Module SIM, for interacting of system and user, user submits inquiry request to by this module and obtains feedback Result；

2) query expansion module, for carrying out word relation excavation to microblogging corpus data, and combines the foundation of WordNet ontology library Weighting word graph of a relation；

3) query processing module, for being converted to the index receptible searching keyword of place and inquiry language by user's inquiry request Sentence, and be based on module 2) the word graph of a relation that builds carries out query expansion；

4) emotion information excavates module, for carrying out emotion excavation to microblogging corpus, and generates emotion entity and feeling polarities Decision rule；

5) emotion information judges and module set up in index, for microblog data is carried out with the judgement of emotion entity and feeling polarities, Set up emotion information index, and stored；

6) inverted index sets up module, for setting up inverted index to microblogging text message, and is stored.

2. the emotion entity search system towards microblogging according to claim 1 is it is characterised in that above-mentioned module 1) in adopt Realize query expansion with following steps:

11) data in microblogging corpus is carried out with dependency rule excavation, output dependency rule excavates obtained related word set；

12) 11 are combined) frequent item set that obtained and WordNet ontology library, build weighting word graph of a relation.

3. the emotion entity search system towards microblogging according to claim 2 is it is characterised in that above-mentioned steps 11) in adopt Excavate the frequent item set of microblogging corpus with Eclat algorithm and generate related word set, and by related word set and WordNet ontology diagram Weighting word graph of a relation is formed by mapping or infix form；

F (d)=deg (d)=deg⁺(d)+deg^-(d),

Wherein deg (d), deg⁺(d)、deg^-D () represents degree, out-degree and the in-degree of node respectively；Side right weight computational methods be：

Wherein lift (d_i→d_j) it is d according to Eclat algorithm gained_i,d_jDegree of association.

4. the emotion entity search system towards microblogging according to claim 1 is it is characterised in that above-mentioned module 3) in adopt Realize query processing with following steps:

31) query word of receiving user's input or sentence；

32) input of user is carried out with participle, the process removed stop words and determine centre word, obtains one or more centre words；

33) centre word is chosen suitable expansion word in the weighting word graph of a relation storehouse by body and regular word construction, and right Expansion word carries out weight calculation；

And then the big front p word of weight selection is added to inquiry set of words, 34) and expansion word set is inputted and connect to inquiry Mouthful.

5. the emotion entity search system towards microblogging according to claim 4 is it is characterised in that above-mentioned steps 33) adopt Following methods carry out weight calculation to expansion word：

Assume that former query word is q=(q₁,q₂,…,q_m), its middle term q_iThere is n_iIndividual closestThen Former query term q_iWith closest lexical item d_ijDegree of association by computational methods be

Wherein W (q_i,d_ij) it is word q_iWith word d_ijDegree of association, g (q_i,d_ij) it is word q_iWith word d_ijWeights sum, f (d_ij) it is word d_ijThe number of degrees, the weighing computation method of all closest words is

6. the emotion entity search system towards microblogging according to claim 1 is it is characterised in that above-mentioned module 4) in adopt Realize identification and the judgement of emotion entity with following steps:

41) gather representative microblog data；

42) pretreatment is carried out to the microblog data collecting, including being carried out to data, convert, subordinate sentence, participle, part of speech mark Note and syntax parsing；

43) feature extraction is carried out to microblog data, be expressed as characteristic vector；

44) train emotion entity recognition model, obtain model parameter；

45) export emotion entity decision model and store.

7. the emotion entity search system towards microblogging according to claim 6 is it is characterised in that above-mentioned steps 43) in Realize feature extraction using following methods：In conjunction with word context, design packet contains global characteristics in interior Custom Dictionaries, according to Custom Dictionaries carry out feature extraction to microblog data, by microblog data be converted into that emotion entity recognition model can process defeated Enter data form.

8. the emotion entity search system towards microblogging according to claim 6 is it is characterised in that above-mentioned steps 44) in adopt Realize emotion entity recognition model using the following method:Introduce global characteristics node in condition random field (CRF) model, set up knot Close the GLCRF model (global conditions random field models) of global characteristics, and obtain model parameter using L-BFGS Algorithm for Training.

9. the emotion entity search system towards microblogging according to claim 1 is it is characterised in that above-mentioned module 5) in adopt Realize the judgement of microblog emotional polarity with following steps：

51) microblog data noise remove and semantic form conversion；

52) participle, part of speech labelling and Chinese syntax parsing；

53) combine sentiment dictionary and extract emotion phrase；

54) emotion phrase filters；

55) feeling polarities judge and result output.

10. the emotion entity search system towards microblogging according to claim 9 is it is characterised in that above-mentioned steps 53) in Emotion phrase is extracted using sentiPY method, the unity of form of emotion phrase is expressed as phrase:modifier* Sentiment, that is, a phrase include a center emotion word (sentiment), multiple modification adverbial words may be attached simultaneously (modifier)；

Above-mentioned steps 55) in using the complex decision algorithm based on emotion drop point, microblog emotional polarity is judged, judged Journey comprises the steps of

551) judge whether there is summary word in sentence, such as no, go to step 552)；If any, then using summarize word after sentence as Emotion drop point, emotion drop point polarity is exported as microblog emotional polarity；

552) using microblogging beginning of the sentence and sentence tail as emotion drop point, beginning of the sentence, sentence tail feeling polarities are compared, if both feeling polarities are mutual Offset, then turn 553)；Otherwise, feeling polarities are exported as microblog emotional polarity compared with powerhouse；

553) calculate the emotion word intensity of whole piece microblogging, summation is simultaneously average, mean intensity is carried out defeated as microblog emotional polarity Go out.