Background technology
In recent years, with the development of the Internet and social networkies, the social network data including microblogging is just with index
Form quickly increases.The continuous growth of microblogging makes people's retrievable information more and more abundanter, but the microblog data of magnanimity
Also the information required for making people be difficult to quickly and accurately find.Simultaneously as the freedom in microblogging style of writing, emotion letter
The extraction of breath is increasingly difficult with respect to traditional text, in the microblogging feelings significant to public sentiment monitoring and investigation and research of products industry
Sense information retrieval field, yet there are no technology and the system of maturation.
Emotion entity search method and system towards microblogging relates generally to the related crucial background technology of three classes.One is
Query expansion technology;Its two be emotion entity extraction technique;Its three be feeling polarities discrimination technology.Individually below to above-mentioned three classes
Background technology is illustrated by respectively and analyzes.
1 query expansion technology
The conventional retrieval system directly inquired about by key word or search engine can obtain the retrieval of some correlations
As a result, but the result searched of this mode using simple match more machinery is it is impossible to get a real idea of the query intention of user,
The result returning also just cannot be satisfactory.Therefore find the query intention that a kind of method can be very good to understand user, improve
The precision ratio of retrieval and recall ratio become the focus solving the above problems.A kind of exactly such method of query expansion technology.Logical
Cross query expansion and can more accurately understand user's query demand, help user faster and more accurately to obtain the information of needs.Warp
The enquiry expanding method of allusion quotation mainly includes associating with being based on based on global analysises, based on partial analysis, based on user's inquiry log
Four kinds of rule.In recent years, scholar is had to propose the enquiry expanding method based on body (or domain body) and semantic net.
It is by excavating in the document of whole set of data or whole data base based on the enquiry expanding method of global analysises
Word degree of association is extended.Have an advantage in that and whole data set sufficiently can be analyzed, will appreciate that document
Every aspect;Its shortcoming is, because common data set is all excessive, the time therefore to analysis and the requirement of equipment are all very high,
Unlikely complete online.Existing searching system is all in the analysis completing overall word offline, searches in real time for demand
Index is held up and is even more difficult in this way.
Relevant feedback and two kinds of pseudo-linear filter are included based on the method for partial analysis.Relevant feedback is to first pass through user
Initial query, obtains retrieval result, then again by user's artificial judgment result document correlation to uncorrelated, point be put in two not
Same document sets.Thus obtain the relevant documentation of labelling, only need to before making query expansion to carry out word to these documents divide
Analyse.Advantage of this is that and only process the document of relevant portion so that number of documents decreases, and degree of association also has
Say lifting;Its shortcoming is to need substantial amounts of manual feedback, and this needs substantial amounts of manpower, and there is still a need for substantial amounts of experiment is carried out
Debugging is processed.So existing searching system or search engine are rare in this way.
Pseudo-linear filter method is to be analyzed using the front n piece result that user inquires about acquisition for the first time, and its theory hypothesis is
The document thinking related to query word in result appears in the foremost of retrieval, that is, thinks that these documents are exactly degree of association
Highest document, obtains expansion word and carries out query expansion by analyzing these documents.Number of patent application is
CN20091032193.5, invention entitled " enquiry expanding method and query expansion system " is exactly special using pseudo-linear filter
Sharp example.Its main thought is to pass through cluster analyses and generate by user inquires about acquired results forward partial document for the first time
Cluster, after cluster is ranked up, then extracts expansion word from ranking previous fixed number purpose cluster, the expansion word of gained is added to former
In inquiry, form expansion word and combine and then carry out quadratic search.The shortcoming of this method is it cannot be guaranteed that first inquire about
Forward document is related, if if uncorrelated, the expansion word that draws may be such that the result of quadratic search more
Uncorrelated, retrieval performance will reduce.
It is a kind of general extended method of present search engine based on the method for user's inquiry log, the method is by right
The inquiry log of user carries out word analysis, using the word of co-occurrence as expansion word.Number of patent application is
CN200710097501.6, invention entitled " enquiry expanding method and device and coordinate indexing dictionary " and number of patent application are
CN200810115470.7, invention entitled " a kind of method of expanding query, device and search engine system " is exactly to user
The query word of input is analyzed obtaining the word of correlation, then using these words as expansion word.This extended method is first
It is also required to obtain substantial amounts of inquiry log, this needs the process of an accumulation.
It is a kind of classical way of data mining based on the method for correlation rule, be often used for excavating the phase between affairs
Guan Xing, can be used for various forms of resources in query expansion and is excavated, for example mining data document sets, inquiry log etc.
Dependency between the word of resource.Number of patent application is CN201010605956.6, invention entitled " expanding user search results
Method and server " be exactly the example carrying out query expansion with Association Rules Technology.This patent adopts a correlation rule number
The rule establishing according to storehouse storage, the rule that this wants can make manual foundation can also make using support-Confidence Framework
Correlation rule excavates to particular document, and the rule of production is saved in association rule database.Work as user input query
During word, obtain the word related to this word first in rule database, then by former query word, the related term of acquisition and both
Portmanteau word form new query word, and quadratic search is carried out to data base.The shortcoming of this method is the failure to by word
Meaning aspect goes to understand a word, simply floats in the frequency aspect of word, such extension also cannot understand well
The query intention of user.
Enquiry expanding method based on body or semantic net by using or build term network word is extended
A kind of technology.This semantic network can be the network having built up, such as WordNet and HowNet;Can also be voluntarily
Build, such as domain knowledge or domain body.Semantic net or ontology library have organized the multilamellar relation of word, such as apposition, on
Hereafter position word, notional word, whole-part word etc. relation, is allowed to form a network with regard to word.Number of patent application is
CN200810116729.X, invention entitled " a kind of semantic query expansion method based on domain knowledge " is to know first with field
The analysis of knowledge and user's sentence feature, to build a domain knowledge base, then utilizes domain knowledge base content, to former query word
Carry out semantic analysis, obtain a semantic item list, more expansible item is obtained by semantic computation;Finally extension is returned and look into
Ask in set and quadratic search is carried out to data base.Number of patent application is CN20101084725.2, a kind of invention entitled " image
Text based query expansion and sort method in retrieval " is, using WordNet net and HowNet net, word is carried out with semantic point
Analyse and obtain the word of semantic extension, in the image indexing system to text analyzing, and invent a kind of returning result to be entered
The algorithm of row Optimal scheduling.By semantic extension, can cognitive user very well query intention, but the expansion word of this method
Data base to be checked is not analyzed, retrieval performance would generally be very limited;And set up field ontology library i.e. arduously again
Time-consuming.
2 emotion entity extraction techniques
Emotion object is exactly the object of emotional expression effect, a usually noun or nominal phrase.Under normal circumstances
If it is not known that emotion object, and it is nugatory for only carrying out Sentiment orientation analysis and research.The extraction of emotion object is made
Extremely important also have much the pass having the challenging task person that obtains correlational study for one in sentiment analysis and opining mining simultaneously
Note.Although having had many emotional expressions and the research of emotion object aspect at present, they are for product review mostly
Information or news information are analyzed.
Different from traditional text message, the freedom that microblogging is composed a piece of writing due to restriction and the network of system number of words, microblogging number
According to due to the reason such as number of words restriction and style of writing freedom, it is allowed to the expression containing a large amount of breviaries, wrong word, special symbol(As table
Feelings symbol, link etc.)Etc. all kinds of literal expressions different from traditional specificationses, these undoubtedly all improve the difficulty of data analysiss.
Due to domestic sentiment analysis and opining mining start late and Chinese and English diversity, in addition correlation technique is jejune
Limit, the research carrying out emotion Object identifying aspect currently for microblogging is also fewer.
Existing emotion Identifying Technique of Object has the Patent No. of BJ University of Aeronautics & Astronautics's application at present
CN201210317183.0, the patent of invention entitled " the viewpoint abstracting method based on word dependence relationship ".The method adopts
Matching algorithm based on word dependence relationship chain extracts evaluation object, does not use other more available auxiliary information raising sides
The accuracy of method, next the method is not necessarily suited for this special text message of microblogging.
Common emotion object extraction in existing list of references mainly for product review carry out, due to there being appointed product
Information and field limit, and problem is more specific, clear, and the extraction work of therefore theme related text is often attained by relatively good
Effect.But effect is not good in the unrelated text of other themes, it is very miscellaneous that this essentially consists in these texts comment object,
In addition emotion word is also diversified.Carry out emotion Identifying Technique of Object currently for the unrelated microblogging of theme seldom, existing method
Be mostly directly by microblogging is carried out syntax dependence analysis obtain with reference to sentiment dictionary paired<Emotion word, emotion pair
As>Relation, thus extract emotion object.The recognition effect of this method is less desirable, there is following weak point:
(1)This extraction process is too dependent on sentiment dictionary and specific several syntax dependence, on the one hand, due to sentencing based on dictionary
Disconnected method is limited, and is affected very big by domain knowledge, therefore can there are a lot of erroneous judgements;On the other hand, microblogging literary composition
The particularity of word expression, emotion word and emotion object are not necessarily confined to specifically several dependences;(2)In microblogging
In, usually some emotion word and its emotion object do not have directly paired appearance in the text, and only emotion word shows emotion feelings
Sense tendency, and emotion object not dominant occur in sentence, then this extraction process can not extract some not directly
Occur in the emotion object in sentence text.
3 feeling polarities discrimination technologies
Have sentiment analysis system and technology at present and focus primarily upon chapter rank and sentence from the granularity of analysis
The sentiment analysis of rank, and Entity recognition and sentiment analysis are divided into two solely by the sentiment analysis technology of the entity level of only a few
Vertical task is carrying out.From the point of view of the object of analysis, current system and technology will be directed to the review information such as news, microblogging, concern
Analysis in Social Public Feelings.
Existing chapter rank and sentence level sentiment analysis technology mainly have at present:The Application No. of Northwestern Polytechnical University
CN200910219161.9, the patent of invention entitled " the WEB text emotion subject identifying method based on mixed model ";China
Application No. CN200910083522.1 of academy of science's Institute of Computing Technology, invention entitled " emotion tendentiousness of text analysis
The patent application of method ";Application No. CN201210088366.X of Institute of Automation Research of CAS, invention entitled
The patent application of " a kind of sentiment analysis method towards microblogging short text ";The Application No. of Fujitsu Ltd.
CN201010157784.0, the patent application of invention entitled " emotional orientation analytical method and device ".
Above-mentioned sentiment analysis technology mainly includes training and two steps of Judgment by emotion, thinks Northwestern Polytechnical University below
Its key step in training and Judgment by emotion is introduced as a example " the WEB text emotion subject identifying method based on mixed model ",
Remaining correlation technique is substantially similar.The method mainly includes following step:1st, manual mark is carried out to the text in training set
Note, estimates two class emotion models:" commendation " model and " derogatory sense " model;Language performance side according to different themes text simultaneously
Formula, estimates all kinds of topic language models respectively;2nd, adopt maximal possibility estimation(MLE)The emotion mould that method is set up for step 1
Type and topic model carry out parameter estimation respectively;3rd, for pending text, its language model and two class emotion models are calculated
Distance, thus judging to the emotion tendency and theme of text.
Current emotion tendency technology focuses primarily upon chapter rank and sentence level, the method based on machine learning
Popularize very much, and the sentiment analysis technology based on emotion drop point is little.
The existing deficiency being primarily present three below aspect based on the sentiment analysis technology of emotion word:A)Emotion phrase
Extract the modification not accounting for adverbial word, but generally adverbial word all this kind of emotion word can produce degree restriction work to adjective
With.If do not taken in, easily cause emotion strength variance;B)The identification of negative word and process problem, general method is
Take a kind of strategy of search to go to search negative word, be difficult to determine the object of negative;C)Some are based on the emotion word automatically generating
Intensity dictionary is unreliable, because emotion word intensity is the base attribute of emotion word, is mainly determined by its original idea.
Content of the invention
It is an object of the invention to overcoming existing emotion entity search technology above shortcomings, proposing one kind and improving
The emotion entity search system towards microblogging of the accuracy rate that feeling polarities judge.
The present invention is achieved through the following technical solutions:The present invention towards the emotion entity search system of microblogging, including following 5
Individual module:
1)User interface, for interacting of system and user, user can submit inquiry request to by this module and obtain
Feedback result;
2)Query expansion module, for carrying out word relation excavation to microblogging corpus data, and combines WordNet ontology library
Set up weighting word graph of a relation;
3)Query processing module, for being converted to the index receptible searching keyword of place by user's inquiry request and looking into
Ask sentence, and be based on module 2)The word graph of a relation building carries out query expansion;
4)Emotion information excavates module, for carrying out emotion excavation to microblogging corpus, and generates emotion entity and emotion
The decision rule of polarity;
5)Emotion information judges and module set up in index, for carrying out emotion entity to microblog data and feeling polarities are sentenced
Fixed, set up emotion information index, and stored;
6)Inverted index sets up module, for setting up inverted index to microblogging text message, and is stored.
Above-mentioned module 1)Middle realize query expansion using following steps:
11)Data in microblogging corpus is carried out with dependency rule excavation, output dependency rule excavates obtained correlation
Word set;
12)In conjunction with 11)The frequent episode that obtained and and WordNet ontology library, build weighting word graph of a relation.
Above-mentioned steps 11)Middle employing Eclat algorithm excavates the frequent item set of microblogging corpus and generates related word set, and will
Related word set and WordNet ontology diagram pass through the form such as mapping or insertion and form weighting word graph of a relation;
During above-mentioned structure weighting word graph of a relation, the computational methods of node weights are:
f(d)=deg(d)=deg+(d)+deg-(d),
Wherein deg (d), deg+(d)、deg-D () represents degree, out-degree and the in-degree of node respectively;The computational methods of side right weight
For:
Above-mentioned module 3)Middle realize query processing using following steps:
31)The query word of receiving user's input or sentence;
32)Input to user carries out participle, the process removed stop words and determine centre word, obtains one or more centers
Word;
33)Centre word is chosen suitable expansion word in the weighting word graph of a relation storehouse by body and regular word construction,
And weight calculation is carried out to expansion word;
34)Then the big front p word of weight selection is added to inquiry set of words, and expansion word set is inputted to inquiry
Interface.
Above-mentioned steps 33)Using following methods, weight calculation is carried out to expansion word:
Assume that former query word is q=(q1,q2,…,qm), its middle term qiThere is niIndividual closest wordThen
Former query term qiWith closest lexical item dijDegree of association by computational methods be
Wherein W (qi,dij) it is word qiWith word dijDegree of association, g (qi,dij) be two words weights, f (dij) it is word dij's
The number of degrees, the weighing computation method of all closest words is
Above-mentioned module 4)Middle identification and the judgement realizing emotion entity using following steps:
41)Gather representative microblog data;
42)Pretreatment is carried out to the microblog data collecting, including being carried out to data, convert, subordinate sentence, participle, word
Property mark and syntax parsing etc.;
43)Feature extraction is carried out to microblog data, is expressed as characteristic vector;
44)Training emotion entity recognition model, obtains model parameter;
45)Output emotion entity decision model simultaneously stores.
Above-mentioned steps 43)Middle realize feature extraction using following methods:In conjunction with word context, design packet contains global characteristics
In interior Custom Dictionaries, according to Custom Dictionaries, feature extraction is carried out to microblog data, microblog data is converted into emotion real
The input data form that body identification model can be processed.
Above-mentioned steps 44)Middle realize emotion entity recognition model using following methods:In condition random field(CRF)In model
Introduce global characteristics node, set up the GLCRF model combining global characteristics, and obtain model ginseng using L-BFGS Algorithm for Training
Number.
Above-mentioned module 5)The middle judgement realizing microblog emotional polarity using following steps:
51)Microblog data noise remove and semantic form conversion;
52)Participle, part of speech labelling and Chinese syntax parsing;
53)Extract emotion phrase in conjunction with sentiment dictionary;
54)Emotion phrase filters;
55)Feeling polarities judge and result output.
Above-mentioned steps 53)Middle employing sentiPY method extracts emotion phrase, and the unity of form of emotion phrase is expressed as
phrase:Modifier*sentiment, that is, a phrase include a center emotion word, may attach simultaneously multiple modify secondary
Word;
Above-mentioned steps 55)Middle using the complex decision algorithm based on emotion drop point, microblog emotional polarity is judged, sentence
Determine process to comprise the steps of
551)Judge whether there is summary word in sentence, such as no, go to step 552);If any then to summarize the sentence after word
As emotion drop point, emotion drop point polarity is exported as microblog emotional polarity;
552)Using microblogging beginning of the sentence and sentence tail as emotion drop point, compare beginning of the sentence, sentence tail feeling polarities, if both feeling polarities
Cancel out each other, then turn 553);Otherwise, feeling polarities are exported as microblog emotional polarity compared with powerhouse;
553)Calculate the emotion word intensity of whole piece microblogging, summation is simultaneously average, and mean intensity is entered as microblog emotional polarity
Row output.
The present invention is directed to the query expansion scheme of microblog emotional entity search, is characterised by carrying out word to microblogging corpus data
Language relation excavation, sets up weighting word graph of a relation in conjunction with WordNet ontology library, and is looked into according to constructed word graph of a relation
Ask extension, to be better understood from the query intention of user;The present invention solves Ontology and language material word in terms of query expansion
The problem that language relation effectively combines, can be better understood from the inquiry purposes of user, so query statement is converted into more suitable
Query expansion word;In terms of the extraction of emotion entity and emotional color analysis, solve microblogging this kind of style of writing degree of freedom larger
The extraction of text emotion object and the decision problem of feeling polarities, the entity in the case of the emotion Objects hide solving extracts and asks
Topic, optimizes the extraction effect of emotion entity, improves the accuracy rate of feeling polarities judgement simultaneously.For network public-opinion monitoring and product
Product the analysis of public opinion provides a kind of excellent solution.The present invention solve microblog emotional entity extract, feeling polarities analysis and
The difficult problems such as emotion entity search, are social networkies the analysis of public opinion and monitoring provides a kind of searching products intelligently.
Specific embodiment
Below in conjunction with accompanying drawing, embodiments of the present invention are described further, but the enforcement not limited to this of the present invention.
Fig. 1 show overall structure figure of the present invention.A kind of emotion entity search system towards microblogging, including:User connects
Mouth module, user can submit inquiry request to by this module and obtain feedback result;Query expansion module, realizes to microblogging language
Material data carries out word relation excavation, and combines WordNet ontology library foundation weighting word graph of a relation;Query processing module, uses
Index the receptible searching keyword of place and query statement in being converted to user's inquiry request, and be based on query expansion module
The word graph of a relation building realizes query expansion;Emotion information excavates module, for emotion excavation is carried out to microblogging corpus, and
Generate the decision rule of emotion entity and feeling polarities;Emotion information judges and index sets up module, for entering to microblog data
Market sense entity and the judgement of feeling polarities, set up emotion information index, and are stored;Inverted index sets up module, is used for
Inverted index is set up to microblogging text message, and is stored.
Fig. 2 shows the workflow diagram of query processing module of the present invention.
With reference to Fig. 2, this flow process comprises the following steps:1st, the query word of query interface receiving user's input or sentence;2nd, warp
Cross query script and the input of user is carried out with participle, the process removed stop words and determine centre word, obtain one or more centers
Word, centre word can be that key word can also make the types such as qualifier;3rd, by centre word in adding of being constructed by body and regular word
Suitable expansion word source is chosen, the word distance of selection is 1, is the closest word of centre word in power word relation picture library;
4th, because the expansion word of the 3rd step gained may be a lot, therefore in order to weigh the importance of each word, each word is carried out
Weight calculation, then the big front p word of weight selection be added to inquiry set of words in;5th, needed for the 4th step has been obtained for
The expansion word wanted, but be intended to introduce a mechanism user can be allowed to understand these expansion words, and word is operated, that is,
The inquiry set of words that modification is extended is so that expansion word all meets user's query intention;6th, by expansion word set return inquire about into
Mouthful, rich media data storehouse is extended retrieve;7th, the result of retrieval is returned and be shown to user.
Fig. 3 shows the query processing of the present invention and the integration details of query expansion module.
With reference to Fig. 3, the query processing of the present invention and query expansion includes background information processing procedure and retrieving two is big
Part, is wherein further divided into micro-blog information abstraction module, sets up index module, builds word relation module, user search
Module and manager's operation and the big submodule of user operation module five.
The process of micro-blog information abstraction module includes organizing initial microblog data, it is carried out with suitable cleaning, divides
Sentence, participle and syntactic analysiss.Mainly setting up an index to microblog data collection supplies quick-searching to set up index module.We adopt
Set up inverted index with Lucene.Lucene is the framework of a full-text search engine increased income, there is provided complete inquiry
Engine and index engine, support boolean operation, fuzzy query, Querying by group etc. operation.Establish inverted index with it and protect
Deposit.
Build the core that word relation library module is this paper, be also the part of innovation.This part is divided into participle mistake
Journey, Eclat dependency rule mining process, dependency rule word generating process and combine WordNet generate weighting word graph of a relation
Process.Participle process is exactly that the literal resource of text is divided into word one by one.We are using higher to Chinese word segmentation accurate rate
ICTCLAS software carry out participle, this be the Chinese Academy of Sciences research and development the system being specifically designed for Chinese word segmentation.Our first one by one logarithms
Document according to collection carries out participle, then more various types of document is combined one document sets of formation, so that dependency rule digs
Pick uses.In the mining process of dependency rule, we adopt the higher Eclat mining algorithm of digging efficiency, this be one deep
Spend preferential algorithm, big document finally can be remerged with the excavation related term of piecemeal.The present invention uses support-interest
The dependency rule framework of degree, this framework adopts two judge formula:
(1), support formula:
(2), interest-degree formula:
Wherein | X ∪ Y | is the number of transactions simultaneously comprising X and Y, and | D | is the affairs sum of data base;Supp (X ∪ Y) is several
Comprise the percentage ratio of X and Y according to affairs in storehouse, supp (X), supp (Y) represent that affairs only comprise X and only comprise Y's respectively simultaneously
Percentage ratio.
Set different support threshold according to different document sets in mining process, and the frequent item set excavated is only
Have when interest-degree is more than 1 and just produce dependency rule item.As long as because it is considered herein that when two words interest-degree be more than 1 when he
Be only positively related.Also added the concept of compound word in mining process:When the interest level of two words is more than 4, by this
Former and later two words of individual regularization term merge generation portmanteau word, and this word forms new rule with the former piece of regular word and consequent respectively
Then, the interest level of new regulation is identical with meta-rule, and such compound word also can be selected as expansion word.Dig in correlation word
Dependency rule word will be produced after excavating and preserve, the form of preservation will be the form of " X Y ".Now complete dependency rule word
Excavation and analysis.
A remaining step is that these regular words and WordNet ontology library are combined into a weighting word graph of a relation.
WordNet is the semantic network based on vocabulary.Vocabulary is not only organized into concept by WordNet, also defines between concept, vocabulary
Multiple semantic relationes (as apposition, up/down position word, antonym, whole-part word, implication etc.), word is formed with the relation of word
One directed graph (as the example of Fig. 3).This process is it is contemplated that mapping regular lexical item in sequence or being added to
In WordNet ontology library, the structure principle that we set weighting word graph of a relation is:Add one between the node of two regular words
Bar points to the directed edge of consequent by former piece.The interpolation full automation of wherein regular word, is divided into two kinds of situations:First, if former
There is this word in WordNet ontology diagram, then only word need to be mapped to figure, then update node data;Second, if former
There is not this word in WordNet ontology diagram, then first add word, then add side and update the data.All node data exist
Count one by one after the completion of figure.The graph of a relation ultimately forming can be represented with four restructuring:G=<V,E,f,g>, wherein V is node
Set, E is the set on side, and f is the function from V to nonnegative real number set, is set to the number of degrees of node;G is to nonnegative real number collection from E
The function closing, is set to the value on two node sides.If d, di,dj∈ V, deg (d) represent node d degree (i.e. the out-degree of this node and
In-degree sum), lift (di→dj) represent node word di、djInterest level, then have:
(1)、f(d)=deg(d)
In weighting word graph of a relation (as the example of Fig. 4), word whole in figure significance level by this word place node
Degree is weighed, i.e. the out-degree of node and in-degree sum (the other integer value of node in Fig. 4);The value on side is weights, its Central Plains WordNet
Weights between this pronouns, general term for nouns, numerals and measure words of figure are set to 1 (blue side in Fig. 4), and the weights between the word being inserted by rule are set to the interest-degree of two words
Value (blue side in Fig. 4), if two words are WordNet relational word and regular word, weights add 1 for interest level.In Fig. 4
The word of black side indication is compound word (as " intellectual property "), and it is identical with the weights of two regular words.Now complete
The structure of weighting word graph of a relation.
User search module includes inquiring about input, query analysis process, coupling extension terms process, generates expanding query word
Aggregation process, search index process and result treatment are simultaneously shown to the process of user.Inquiry input is exactly to connect in query interface
Receive query word or the sentence of user input;Query analysis are that the input of user carries out participle, removes stop words and determine centre word
Process, obtain one or more centre words;Coupling extension terms process is that the centre word of previous step is input to weighted words
Suitable expansion word source is chosen, that is, (i.e. distance is 1 away from the nearest word of former query word from the selection of this in figure in language relation picture library
Word) as candidate's expansion word.Generate the degree of association that expanding query set of words process is according to each word and former query word, calculate
Before choosing after the weight of word, p is as final expansion word.Invention creates the formula calculating each term weighing, according to weighted words
The structure of language graph of a relation understands:If the weights of two nodes are bigger, represent that the degree of association of this two nodes is also bigger;And if
The degree of node is bigger, shows that the importance of this node is also bigger.
Assume that former query word is q=(q1,q2,…,qm), its middle term qiThere is niIndividual closest word di=(di1,di2,…,qini),
Then former query term qiWith closest lexical item dijDegree of association by computational methods be
Wherein W (qi,dij) it is word qiWith word dijDegree of association, g (qi,dij) be two words weights, f (dij) it is word dij's
The number of degrees, the weighing computation method of all closest words is
Wherein W (dk) it is word dkWeight, m represents the number of former query word.In the weight calculating each candidate's expansion word
Afterwards, weight is arranged in descending order, and choose front p word and be added in former inquiry, constitute expansion word set, its Central Plains query term
Weight is all 1.
By previous step and the set of words that is expanded, as following form:
Q=(q1,q2,...,qm,d1,d2,...,dp) (4)
Retrieving refers to that expansion word set is returned inquiry entrance returns inquiry entrance, expands to rich media data storehouse
Exhibition retrieval.The result treatment process showing refers to return the result of sorted retrieval and be shown to user.
Fig. 4 is the flow chart of feeling polarities analysis method proposed by the present invention.
With reference to Fig. 4, the method comprises the following steps:
(1) noise remove of comment language material and semantic form conversion:
The noise remove of comment language material is mainly removing interference clause's such as subjunctive mood.The sentence non-genuine visitor of these interference
The evaluation seen, the analysis in stage after disturbing.Replacement emoticon is corresponding word, thus semantic form is converted into close friend
The form processing.
(2) natural language processing:Mainly use Stanford NLP software and participle, part of speech labelling are carried out to comment language material
And Chinese syntax parsing.
(3) combine sentiment dictionary and extract emotion phrase:
Because POS tagger label in comment language material for the emotion word is concentrated mainly on above a few label,
We just combine these part of speech labels and sentiment dictionary extracts emotion phrase.SentiPY method using our exploitations extracts feelings
Sense phrase, in the unity of form of the system emotion phrase be:
phrase:modifier*sentiment
, that is, a phrase include a center emotion word, multiple modification adverbial words may be attached.
(4) emotion phrase filters:The coarseness emotion phrase extracting in 3rd step is filtered so that emotion phrase
Form is purer, such that it is able to lift the accuracy of final polarity classification.
(5) sentiment analysis result is exported
We devise a complex decision algorithm based on emotion drop point, and this algorithm can be effectively to different field
Comment language material is analyzed.
Fig. 5 is the graph structure example in emotion strength optimization based on neighbouring relations.With reference to Fig. 5, the feelings in comment language material
The node of in figure regarded as in sense word, can calculate the emotion intensity of context based on the algorithm propagated.Based on sentiment dictionary, extract
The adjacent relation of emotion word the weight by NGD calculating two emotion word nodes, thus form a directed graph.Figure three is one
The graph structure of comment.
Fig. 6 is emotion calculation method of impact flow chart.With reference to Fig. 4, in this step, our target finds a comment
Emotion drop point.So-called emotion drop point is exactly the emotion part that author mainly thinks expression in a comment.Our Main Basiss
Recapitulative vocabulary(As " overall "), compare beginning ending at emotion intensity and the strongest emotion phrase in sentence, thus looking for
Emotion drop point to a comment.
Fig. 7 shows that the present invention is directed to the workflow diagram of microblog emotional entity extraction.
With reference to Fig. 1, the emotion entity of the present invention extracts and includes microblog data collection, data prediction, feature extraction, dictionary
Loading, labelling and the step such as correction, model training and emotion object extraction.The microblogging number that microblog data collection crawls from the Internet
According to saving in the form of a file, the emotion object extraction model that model training obtains also can be conserved for object
Extract, the result that emotion object extraction obtains preserves in the form of by file, so that user checks and correction predicts the outcome.
Microblog data gathers, for the microblog system on the Internet(As Sina weibo, twtter and Tengxun's microblogging etc.)
Crawl microblog data, and the microblogging collecting initial data is preserved down in the form of a file according to certain organizational form
Come, the later stage for system processes offer data support.
Data prediction, anticipates for carrying out some to original microblog data, is easy to the later stage and carries out feature extraction.
This module includes data cleansing, data conversion, subordinate sentence, participle, part-of-speech tagging and syntax parsing.Details are as shown in Figure 2.
Dictionary loads, for the related dictionary required for loading data pretreatment and characteristic extraction step, this dictionary bag
Include the dictionary data such as sentiment dictionary, stop words dictionary, classical network dictionary.
Feature extraction, the dictionary data pair that load-on module loads with the help of a dictionary carries out pre-defined spy with the data after process
The extraction levied, text vector is converted into the form that object extraction module can be processed.
Emotion object model is trained, and the emotion object extraction model for the system core is trained.From labelling and repairing
Positive module obtains and is converted into the training data requiring form, using L-BFGS algorithm to the CRF model building according to training data
It is trained.The CRF model that the present invention uses is in Linear CRF(Linearity condition random field)On the basis of model develop and
Come, be CRF(Condition random field)Model first time is applied in emotion Object identifying field.By in traditional CRF model
Middle interpolation global variable, thus reach can recognize that the not dominant situation about occurring in labelled sequence of emotion object.
Emotion object extraction, for extracting emotion emotion object from microblog data, this step mainly utilizes model to instruct
Practice the model that trains of module to be predicted thus reaching the purpose of extracting object.
Labelling and correction, the CRF model used in the present invention has supervision statistical learning method it is therefore desirable to logarithm for one
According to being labeled.It is simultaneously introduced feedback mechanism error analyses information is learnt.Existing method is general not for point result by mistake
Deal with, but these feedback informations contain a large amount of useful informations, how to make full use of these information and become system realization
The key of self-teaching.The introducing of feedback mechanism enables model that the result of error analyses is learnt again so that being
System is more used more accurate.
Fig. 8 shows the schematic diagram of realizing of data prediction step of the present invention, and data prediction step comprises the following steps:
(1)Data cleansing process step, reads data from the original microblog data that data acquisition module is collected, enters line number
Data cleansing process in Data preprocess, filters out some skies, invalid dirty microblog data.
(2)Data conversion processing step, this step process from(1)It is transmitted through the data come, to microblog data after step process
In some contents carry out conversion processing, be easy to(3)(4)(5)(6)Step relevant treatment, common have following several situation:(a)
Usually containing some information invalid to work in microblogging, then need to weed out;(b)Some useless chains for our work
Connect(As image link and web page interlinkage etc.)Need to weed out with special string;(c)Band " # " symbol is usually included in microblogging
Number topic and band " " symbol contact person be also carried out process, we microblogging head and tail occur topic and contact person straight
Connect deletion, in microblogging sentence, then only delete " # " and "@" symbol;(d)Some emoticons are usually included in microblogging, this
A little symbols are with intense emotion tendency, be also the helpful information of the work to us, but these symbols can affect
Participle, part-of-speech tagging(POS marks)With the precision of syntax parsing, therefore need in the process to extract;(e)Need to micro-
In rich, some cyberspeaks are changed, and for example, " V5 " of network expression way are changed into " powerful " of specification expression etc., this is same
Sample is favorably improved participle, part-of-speech tagging(POS marks)Precision with syntax parsing.
(3)Microblogging text subordinate sentence process step, the conditional random field models of the emotion object identifying method of the present invention are structures
It build the sequence mark of sentence level in, carry out information extraction, but a microblogging agrees to include more than one sentence, therefore
Need to carry out subordinate sentence process to it.Mainly subordinate sentence is carried out according to punctuation mark in subordinate sentence processing procedure.But due to microblogging
Particularity, it is inadequate for carrying out subordinate sentence only according to punctuate.In microblogging a lot of people for convenience, custom space or spy
Different symbol(As "~" etc.)Carry out subordinate sentence, be therefore also directed to these situations in the process and carried out corresponding subordinate sentence process.
(4)Sentence word segmentation processing step, the conditional random field models of the emotion object identifying method of the present invention are to sentence
In the sequence of rank, each word is marked it is therefore desirable to carry out word segmentation processing.What participle process was used is some conventional networks
Term lexicon dictionary(As " going mad ", " surrounding and watching " etc.)For improving the accuracy of participle.
(5)The part-of-speech tagging step of word in sentence, this step carries out part-of-speech tagging to each word after participle, is the present invention
Feature Selection Model carry out during feature extraction provide word part of speech correlated characteristic.
(6)Syntax analyzing step, this step parses the syntax between word in sentence using syntax analytical tool and relies on pass
System, purpose carries out providing the dependence correlated characteristic of word during feature extraction for the Feature Selection Model of the present invention.
Fig. 9 realizes schematic diagram for emotion Object identifying model training step of the present invention.Reference Fig. 9, in this step,
The microblog data that the training dataset of mark crawls from the Internet from data acquisition module, line number of going forward side by side Data preprocess mould
Block is processed.Due to the condition random field adopting in the present invention(CRF)Model carries out emotion object extraction, and CRF model is one kind
Supervised learning method, training dataset therefore in the training process also needs to carry out artificial labeled data collection.Training pattern
During it is necessary first to using dictionary load-on module load user-oriented dictionary, including emotion word dictionary and stop words dictionary;Next step
Exactly with reference to the dictionary of last loading, training dataset is carried out with feature extraction normalized number evidence using characteristic extracting module;
Final step is that to upper step, normalized data carries out model parameter training using model training module, using L-BFGS algorithm instruction
Practice the parameter that study obtains model.
The conditional random field models used in the present invention form as shown in Figure 10, regards emotion object recognition process as
It is a sequence mark problem.The X of the ground floor of this model represents the microblogging sentence of input, xiRepresent i-th position in sentence
Word, the y of the second layeriG with third layer1、g2Output result state, the label of these states willing can value be:L={'N-
B', ' N-I', ' P-B', ' P-I', ' this five labels of O'}, it represents each position mark label of sequence during sequence mark
Valued space, wherein the starting position label of N-B tag representation negative sense emotion object, N-I tag representation negative sense emotion object
Follow-up label(It is that its previous label is necessary for N-B or N-I), the starting position mark of P-B tag representation forward direction emotion object
Sign, the follow-up label of P-I tag representation forward direction emotion object(Previous label is necessary for P-B or P-I in the same manner), O label list
Show other all labels, that is, have yi∈L.Such as sequence is { " mobile phone ", " screen ", " very ", " clear " }, " mobile phone screen " be
The emotion object of one forward direction, is { " P-B ", " P-I ", " O ", " O " } to the result that it is marked.
With two global node g in model1And g2Represent two independent single emotional objects, therefore value be only '
N-B', ' P-B', ' these three labels of O'}, or being P-B label for positive emotion object, or for negative sense emotion object being
For N-B label, or not being that emotion object is O label, and can not possibly be follow-up label N-I and P-I of emotion object.
In order to improve motility and the expansibility of emotion Object identifying, the conditional random field models not office that the present invention adopts
It is limited to the figure result shown in Fig. 9, represent non-and dominant be also not limited to two hiding node g1And g2, can be extended to as Figure 11
Shown g1…gn(n>=1).
Particular embodiments described above, has carried out detailed further to the purpose of the present invention, technical scheme and beneficial effect
Describe in detail bright, be should be understood that the specific embodiment that the foregoing is only the present invention, be not limited to the present invention, all
Within the spirit and principles in the present invention, any modification, equivalent substitution and improvement done etc., should be included in the guarantor of the present invention
Within the scope of shield.