CN100570611C - A kind of methods of marking of the information retrieval document based on viewpoint searching - Google Patents

A kind of methods of marking of the information retrieval document based on viewpoint searching Download PDF

Info

Publication number
CN100570611C
CN100570611C CNB2008101186680A CN200810118668A CN100570611C CN 100570611 C CN100570611 C CN 100570611C CN B2008101186680 A CNB2008101186680 A CN B2008101186680A CN 200810118668 A CN200810118668 A CN 200810118668A CN 100570611 C CN100570611 C CN 100570611C
Authority
CN
China
Prior art keywords
document
user
speech
query word
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB2008101186680A
Other languages
Chinese (zh)
Other versions
CN101344890A (en
Inventor
张敏
马少平
茹立云
佟子健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Original Assignee
Tsinghua University
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Beijing Sogou Technology Development Co Ltd filed Critical Tsinghua University
Priority to CNB2008101186680A priority Critical patent/CN100570611C/en
Publication of CN101344890A publication Critical patent/CN101344890A/en
Application granted granted Critical
Publication of CN100570611C publication Critical patent/CN100570611C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of methods of marking of the information retrieval document based on viewpoint searching belongs to field of information processing.It is characterized in that: it at first sets up the tabulation of emotion speech, and all that appointment will be used in searching system in tabulation have the speech of emotion, then according to the inquiry of the concrete input of user, generates the candidate result set; Secondly in system, calculate the correlativity between document and the user inquiring, obtain the relevance score of every piece of document; According to the number of times of the emotion speech that occurs jointly in the certain distance scope with query word in the document, calculate the subjective and objective property scoring of every piece of document in the system then; The relevance score of one piece of document and the scoring of subjective and objective property are carried out the merging based on quadratic function (promptly multiplying each other), the document after obtaining merging is finally marked again; Last searching system sorts to all candidate documents in the system according to the final scoring of document, and is shown to the user according to scoring order from big to small.This technology has computing machine to be finished automatically, can return the advantage that not only has high correlation but also have the result for retrieval of strong subjective suggestion.

Description

A kind of methods of marking of the information retrieval document based on viewpoint searching
Technical field
The invention belongs to field of information processing, particularly relate to information retrieval system, is the method for in information retrieval system document being marked specifically, finally obtains relevant with user inquiring and result for retrieval that have subjective suggestion.
Background technology
Information retrieval system is with the certain strategy acquisition of information (web document on the internet for example, or the digital document in the digital library etc.), the computer system that retrieval service afterwards is provided for the user, two parts of software program that it comprises computer hardware system and moves are organized and handled to information on hardware system.Its main effect is to help the user to obtain the useful information that those can be met consumers' demand fast, efficiently.
Information retrieval system is undertaken by querying server and user alternately.Querying server provides the page that can import or select the query demand of oneself for the user on the one hand, and query demand is made of one or several keyword usually.The inquiry submitted at the user of querying server is on the other hand retrieved in all document in system, returns the correlated results tabulation after the ordering, and show the user in order in results page.Be exactly relevance score between document and the user inquiring to the sort by of document in the system wherein, it is high more to mark, and it is forward more to sort.Therefore a gordian technique in the querying server is how to calculate the correlativity of document, obtain supplying the relevance score of ordering, thereby the page that makes the user want comes the preceding column position of return results, easier accessed by the user the arriving of relevant information.
Along with the development of information retrieval system, people also begin to have proposed more more complicated query demands, and wherein a kind of important need is viewpoint searching (opinion search).In viewpoint searching, system not only will find out the information relevant with user inquiring, and these information must have certain emotion, and promptly searching system will be returned relevant and information that have the subjectivity suggestion.And those relevant informations that only are objectivity is described also are considered to not meet user's request.For example a user wants to buy a mobile phone, his input inquiry " N95 mobile phone " in searching system carries out viewpoint searching so, wish that the content of returning is about the various performances of N95 and the suggestion on the function in other users or the comment, for example whether the price of mobile phone is cheap, whether outward appearance is good-looking, whether battery is durable, and whether overall assessment is good, or the like.Under such requirement, querying server not only will calculate the relevance score of each document, also will calculate its subjective and objective property scoring, then these two kinds of scorings is merged together, and forms the final score of one piece of document, and returns to the user after the ordering.
Since middle 1960s, people have proposed the model of the correlativity of a lot of calculating documents and user inquiring.Its main thought is exactly " TF*IDF ", promptly one side is considered the number of times that user's query word occurs in one piece of document, and occurrence number is many more might be correlated with more, considers the common degree of this query word in whole documents on the other hand, common more, its separating capacity is weak more.Current application model widely mainly contains boolean's model (Boolean Model), statistical model (Statistical Model) and linguistry model (Linguistic and Knowledge-based Model).
Aspect the subjective and objective property scoring of calculating document, a kind of method commonly used is to see the speech (being commonly referred to the emotion speech) that has occurred how much having the subjectivity emotion in this piece document.The emotion speech that occurs is many more, and then the subjective and objective property scoring of document is high more.Wherein the emotion speech is provided by the emotion speech tabulation that an artificial or automatic method builds in advance usually.Another kind method is the technology by text classification, and one piece of document is divided into subjective class or objective class, belongs to the degree difference of certain class according to the document, obtains the subjective and objective property scoring of the document.
In viewpoint searching, how scoring of subjective and objective property and relevance score are merged, be a key factor that influences the viewpoint searching system performance.Owing to lack further investigation, the contact between these two scorings is not set up in the present correlation technique, method commonly used is addition simply to be carried out in these two kinds of scorings in certain proportion (for example A is multiply by in the objectivity scoring, relevance score be multiply by B, and the two is added up.A and B are prior given numerical value).Many experimental results show that this method can not satisfy user's viewpoint searching demand well.Even under many circumstances, merging result afterwards through this correlativity and subjective and objective property scoring, its performance is not as good as the effect that offers user's result after only using relevance score to sort.Therefore, be necessary to propose a kind of method that merges correlativity and the scoring of subjective and objective property effectively,, satisfy the demand that the user carries out viewpoint searching to improve the performance of information retrieval system.
Summary of the invention
The viewpoint searching system will return relevant with user inquiring and information that have the subjectivity suggestion according to user's request.The present invention proposes a kind of methods of marking of not only having considered the subjective and objective property of document but also having considered the correlativity of document.Be different from traditional weighted linear merging method and (promptly two scorings multiply by the weights of a constant respectively, mode with addition merges then), in the merging method that the present invention proposes, taken all factors into consideration the influence of correlativity, relevance score and the subjective and objective scoring relation with quadratic function (promptly multiplying each other) has been merged subjective and objective property.Wherein consider the value difference of two kinds of scorings especially, with subjective and objective scoring carried out the taking the logarithm normalized of (log).By this merging method, return in user's the result document after the final ordering, the forward document of those orderings had both had very high correlativity, has very strong subjectivity again, thereby improve the performance of viewpoint searching system effectively, make things convenient for the user in the results page that early occurs, just can carry out the visit of information, therefore more can satisfy the demand of User Perspective retrieval.
The particular content of this method is as follows:
1. at first set up the tabulation of emotion speech, all that appointment will be used in searching system in tabulation have the speech of emotion, input to the inquiry of system then at the user, and system finds out the candidate documents set automatically;
2. in searching system, calculate the degree of relevancy between the inquiry that every piece of document and user import, obtain the relevance score of document;
3. according to the number of the speech that has emotion (being the emotion speech) that occurs jointly in set distance range with query word in one piece of document, calculate the subjective and objective scoring of document;
4. the relevance score of one piece of document and subjective and objective scoring are carried out the merging based on quadratic function (promptly multiplying each other), the document after obtaining merging is finally marked;
5. searching system sorts to all candidate documents in the system according to final scoring, and is shown to the user according to scoring order from big to small.
The invention is characterized in:
Described method realizes in a machine information retrieval system successively according to the following steps:
Step (1). set up the emotion speech tabulation of Chinese and English contrast in the viewpoint searching server of described system, described emotion speech comprises an attribute in the following all properties at least: good, good, beautiful, big, bad, green bristlegrass and puppet;
Step (2). at least one query word w of user's input i, described system has arbitrary query word w in the user inquiring to all by information retrieval iDocument d pick out automatically, as candidate documents set, described candidate documents abbreviates document d as, and sets up document index, adds up all indexed total number of documents N, calculates the speech that is comprised in average every piece of document and counts avdl; Simultaneously, to each query word w of user i, add up this query word w that occurred in the described candidate documents set iThe sum of document d, with variable df (w i) expression;
Step (3). be calculated as follows each the query word w in each piece document d and the user inquiring iRelevance score ScoreI Rel(d, w i):
ScoreI rel ( d , w i ) = ln ( N - df ( w i ) + 0.5 df ( w i ) + 0.5 ) × ( k 1 + 1 ) × c ( w i , d ) k 1 ( 1 - b ) + b l d avdl + c ( w i , d ) × ( k 3 + 1 ) × c ( w i , q ) k 3 + c ( w i , q ) ,
Wherein: k 1Be the constant between 1.0~2.0,
C (w i, d) be speech w iThe total degree that in document d, occurs,
C (w i, q) be speech w iAppear at the total degree among the user inquiring Q,
B is the constant between 0.0~1.0,
l dBe the length of document d, show with the total speech numerical table among the document d,
k 3It is the integer constant between 0~1000;
Step (4). press following formula the relevance score addition of all query words in each piece document d and the user inquiring, obtain the relevance score of the document d about user inquiring Q, q is user's query word set:
ScoreI rel ( d , q ) = Σ w i ∈ q S coreI rel ( d , w i ) ;
Step (5). be calculated as follows the subjective and objective property scoring ScoreI of every piece of document d Op(d, q), q is user's query word set:
Figure C20081011866800072
If λ ≠ 0,
Wherein: λ is the constant between 0.0~1.0,
s i∈ d is each the emotion speech among the document d,
Co (s i, w i| W) be with user inquiring Q in arbitrary query word w iAppear at simultaneously among the document d and with query word w iPosition distance less than the emotion speech s of W iThe total degree that on all positions, occurs, W=25, unit is the speech number, c (w i, d) be query word w iThe total degree that in document d, occurs;
Step (6). be calculated as follows the final scoring of each piece document, use ScoreI Rel(d, q) * ScoreI Op(d, q) expression;
Step (7). in described system, all candidate documents are sorted according to its corresponding final scoring order from big to small, obtain the results list behind the viewpoint searching and return to the user as final result for retrieval.
The present invention can take into full account getting in touch between the relevance score of the subjective and objective scoring of document and document, the mode of these two kinds of scorings with quadratic function (promptly multiplying each other) merged, also consider the difference of these two kinds of scorings simultaneously, therefore the processing of taking the logarithm, the results list of the viewpoint searching system that finally is optimized have been carried out in the subjective and objective scoring of document.Thereby, those not only can have been had emotion but also come the front with the closely-related document of the content of user inquiring and returned to the user at user's viewpoint searching demand.This method is handled simple, algorithm complex is low, has obtained good result on test data, has improved the performance that search engine carries out viewpoint searching greatly.This explanation the present invention has generalization and adaptability preferably, can the viewpoint searching result of search engine effectively be improved, and has a good application prospect.
Description of drawings
Fig. 1. information retrieval system is carried out the basic procedure framework of viewpoint searching.
Fig. 2. the viewpoint searching methods of marking flow process that the present invention proposes.
Embodiment
It is finished on computers automatically, contains successively to have the following steps:
Step 1 generates tabulation of emotion speech and candidate documents set
Emotion speech tabulation comprise that system will handle all have the speech of emotion, as " good " of Chinese, " bad ", " disappointing " etc., English comprises " good ", " bad " etc.Here to knowing the speech among the net HowNet, carry out automatic screening according to its attribute, if the attribute definition of a speech in knowing net comprises " good| is good " at least, " desire| is good ", " beautiful| U.S. ", " great| is big ", " bad| is bad ", " undesired| green bristlegrass ", in " fake| puppet " one, then the English descriptor of this speech and this speech correspondence is picked out, added Chinese and the tabulation of English emotion speech respectively.
At the one query (may comprise a plurality of query words) of user input, searching system has in the user inquiring document of arbitrary query word with all to be picked out automatically, gathers as candidate documents.Later operation is all carried out in the scope of this candidate documents set, and other document has not just all been considered in current user inquiring.
Step 2 is calculated the relevance score of one piece of document and user inquiring
Can use correlation calculations method commonly used in the existing information searching system to obtain the relevance score of each piece candidate documents and inquiry.Use formula as follows:
ScoreI rel ( d , q ) = Σ w i ∈ q [ ln ( N - df ( w i ) + 0.5 df ( w i ) + 0.5 ) × ( k 1 + 1 ) × c ( w i , d ) k 1 ( 1 - b ) + b l d avdl + c ( w i , d ) × ( k 3 + 1 ) × c ( w i , q ) k 3 + c ( w i , q ) ]
(formula 1)
Wherein: d is the document after step 1 is handled; Q is the user inquiring after step 1 is handled; ScoreI Rel(d q) is document d and the relevance score of inquiring about q; w i∈ q is each speech in the user inquiring; Ln () is for going the natural logarithm operation to the content in the bracket; N is the whole total number of documents in the system; Df (w i) be that all comprise speech w in the system iTotal number of documents; k 1It is the constant between 1.0 to 2.0; C (w i, d) be speech w iThe total degree that in document d, occurs; B is the constant between 0.0 to 1.0; l dBe the length (being the total speech number among the document d) of document d; Avdl is the average length of all documents in the system; k 3It is the integer constant between 0 to 1000; C (w i, be that speech appears at the number of times among the inquiry q q).In these parameters, remove k 1, b and k 3Be beyond the artificial constant of setting, other each values can obtain according to user inquiring and the automatic statistics of collection of document.
Step 3 is calculated the subjective and objective scoring of one piece of document
Use the subjective and objective property scoring of every piece of candidate documents in the following formula computing system:
Figure C20081011866800091
If λ ≠ 0, (formula 2)
Wherein: d is the document after step 1 is handled; Q is the user inquiring after step 1 is handled; ScoreI Op(d q) is the subjective and objective scoring of document d with respect to inquiry q; λ is the constant between 0.0 to 1.0; s i∈ d is each the emotion speech (wherein the scope of emotion speech is given by the tabulation of the emotion speech in the step 3.1) among the document d; Log () is operation that the content in the bracket is taken the logarithm; Co (s i, w i| W) be arbitrary query word w with user inquiring Q iAppear at simultaneously among the document d and with query word w iPosition distance less than the emotion speech s of W iThe total degree that on all positions, occurs; W is the size of distance (being also referred to as window), is a positive integer; C (w i, d) be query word w iThe total degree that in document d, occurs.In these parameters, except that λ and W were constant by artificial setting, other each values can be added up automatically according to user inquiring, the tabulation of emotion speech and collection of document and be obtained.
Step 4 is calculated the final scoring of one piece of document
Step 2 and resulting relevance score of step 4 and subjective and objective scoring are multiplied each other, that is:
ScoreI Rel(d, q) * ScoreI Op(d, q), (formula 3)
Just can obtain that one piece of candidate documents d finally marks with respect to the viewpoint searching of user inquiring Q in the system.
Step 5 obtains final viewpoint searching the results list
In searching system, all candidate documents are sorted according to its corresponding final scoring order from big to small, just obtained the results list behind the viewpoint searching, and returned to the user as final result for retrieval.
In order to verify validity of the present invention, reliability and application, we design and have tested relevant confirmatory experiment.
On data source, we have used the text retrieval meeting TREC given standard testing data of the NIST of American National technical institute tissue: the English blog web page data acquisition in internet, 100 user inquirings and each are inquired about corresponding answer set (being obtained by the artificial mark of NIST tissue).
In confirmatory experiment, use average retrieval precision (MAP) commonly used in the information retrieval to carry out the evaluation of performance.
Carry out the confirmatory experiment of viewpoint searching according to step of the present invention on above-mentioned data acquisition, table 1 has been listed and has been used methods of marking of the present invention, compares with the method for linear weighted function commonly used now, and the searching system performance of bringing improves situation.Linear weighted function method wherein commonly used now is:
λScoreI op ( d , q ) + ( 1 - λ ) Σ w i ∈ q co ( s i , w i | W ) Σ w i ∈ q c ( w i , d ) × W ,
Relevant parameter in the method for the implication of its each several part parameter and computing method and formula 1 of the present invention and formula 2 is identical.The parameter of the artificial setting of using among the present invention value respectively is: k 1=1.0, b=0.75, k 3=100, W=25.The value of λ is as shown in table 1.The amplitude that can see the performance raising is bigger, all more than 8%, is up to 18.6%.
Table 1 methods of marking of the present invention is compared the searching system performance of bringing and is improved with linear weighted function method commonly used now
The value of λ The retrieval performance of linear weighted function method commonly used The retrieval performance of the methods of marking that the present invention provides The performance that the present invention brings improves
0.01 0.1969 0.2253 14.42%
0.1 0.2041 0.2255 10.49%
0.2 0.2071 0.2256 8.93%
0.3 0.2081 0.2257 8.46%
0.4 0.2087 0.2257 8.15%
0.5 0.2067 0.2259 9.29%
0.6 0.2038 0.2266 11.19%
0.7 0.1993 0.2267 13.75%
0.8 0.1938 0.2255 16.36%
0.9 0.1866 0.2213 18.60%
For example to user inquiring " Oprah " (Oprah Winfrey is the host of a talk show TV programme of the U.S.), the comment and the viewpoint information of her program found in hope.But the classic method of using, what find is a large amount of the rendition lists and objectively content introduction and propaganda, therefore the MAP precision of this inquiry is had only 0.0687, has only two to have viewpoint information among preceding 10 results that searching system is returned; And adopt methods of marking proposed by the invention to retrieve, then improved searching system MAP precision has brought up to 0.2721, and among preceding 10 results that return, have 8 all to be viewpoint and the review information that the user needs, system performance is 4 times of classic method system performance.
User inquiring " tivo " (being a digital VTR) for example wishes to find everybody evaluation to this brand again.Use classic method, found the product introduction of a large amount of these brands, comprised the objective information of numeric types such as many models, volume, but the information of the good or bad viewpoint of this brand is not provided, therefore in preceding 10 return results to this inquiry, have only 1 to be the information that the user needs; And the method for using the present invention to propose, then come the foremost with those other users' use gains in depth of comprehension, impression and to information documents such as product evaluation viewpoints, among preceding 10 results that the system that makes returns, there are 9 all to be the relevant information that has viewpoint that the user needs, greatly improved the satisfaction that the user uses.
Accompanying drawing 1 has been described the basic procedure framework that information retrieval system is carried out viewpoint searching:
1, at first will collect the raw data that obtains and carry out pre-service, comprise: remove speech (being called stop words) too commonly used in the document, for example " ", " ", " getting ", " " etc.The stop words tabulation can freely be formulated as required; Use participle technique (the longest coupling of forward is divided word algorithm) that document is carried out participle to Chinese content, with the base unit of speech (comprising monosyllabic word) as document, all interior words of speech are all regarded an integral body as and are handled.
2, utilize the inverted entry technology to set up index to the content of document then.
3, the inquiry that the user is submitted to uses the pre-service identical with document (removing stop words, participle) method to inquire about pre-service.
4, at last according to the proposed method, utilize the emotion speech tabulation of having set up, in the viewpoint searching server, inquiry after handling and the document behind the index are mated, every piece of document is obtained the viewpoint searching scoring, form results page after sorting, return to the user.
The present invention is exactly the 4th step at above-mentioned flow process, promptly how to set up emotion speech tabulation, and how document is marked in the viewpoint searching server and propose corresponding method.Accompanying drawing 2 has been described flow process of the present invention.Carry out detailed process description with regard to how in the viewpoint searching system, implementing the present invention below.
1. according to the one query content of user's submission, system finds out the candidate documents set
Document in the system is carried out pre-service, comprise the removal stop words, Chinese document carries out participle, and sets up document index.Simultaneously the user is submitted to the query contents of system, also carry out on all four preprocessing process.
All indexed total number of documents in the statistical system at first just obtain the value of the variable N in the formula (1).The speech number that average every piece of document comprises in the computing system (total getting final product divided by the total number of documents in the system with the speech number of all documents in the system) is be exactly the value of variable avdl in the formula (1).
After having submitted inquiry Q to, the user (comprises one or more speech in the inquiry), to (also comprise one or more speech in the document) in all documents that index is good, those documents that occurred a query word among the user inquiring Q are at least picked out, and constitute the candidate documents set.Remaining those do not comprise document all no longer considerations in this retrieving of any query word.To each the speech w among the inquiry Q i, the sum of the document of this speech that occurred is designated as variable df (w in the formula (1) in the set of statistics candidate documents i) value.
The value of each constant of using in the initialization system for example can be made as: k 1=1.0, b=0.75, k 3=100, W=25, λ=0.8.
2. generate the tabulation of emotion speech according to following flow process.
Initialization Chinese emotion dictionary tabulation S CN={ }; The English emotion dictionary tabulation of initialization S EN={ }; To knowing each speech in the net (HowNet): one of following if the value of its attribute (DEF) comprises at least: " good| is good ", " desire| is good ", " beautiful| U.S. ", " great| is big " so: the Chinese of this speech is described (value of W_C) adds Chinese emotion dictionary tabulation S CNAs a record; The English of this speech is described (value of W_X) add English emotion dictionary S ENAs a record; If it is one of following that the value of its attribute (DEF) comprises at least: " bad| is bad ", " undesired| green bristlegrass ", " fake| puppet " so: the Chinese of this speech is described (value of W_C) adds Chinese emotion dictionary tabulation S CNAs a record; The English of this speech is described (value of W_X) add English emotion dictionary S ENAs a record; To English emotion dictionary S ENIn each bar record: if should be made up of so a plurality of speech by record: each speech in will writing down also writes down adding emotion dictionary S as one respectively ENLeave out English emotion dictionary S ENIn duplicate record.
In the tabulation that obtains according to this flow process, wherein Chinese word has, and English emotion speech has 4621.
3. calculate the relevance score of every piece of document
Every piece of document d in the candidate documents set is carried out following operation:
(1) total speech number of statistics document d, note is made variable l in the formula (1) dValue; To each the speech w among the inquiry Q i, add up the number of times that this speech occurs in document d, be designated as variable c (w in the formula (1) i, value d); Add up the number of times that this speech occurs in inquiry Q, note is made variable c (w in the formula (1) i, value q);
(2) with N, the avdl, the df (w that have obtained in the above-mentioned steps i), k 1, b, k 3, l d, c (w i, d), c (w i, value substitution formula (1) q) is carried out following calculating:
ln ( N - df ( w i ) + 0.5 df ( w i ) + 0.5 ) × ( k 1 + 1 ) × c ( w i , d ) k 1 ( 1 - b ) + b l d avdl + c ( w i , d ) × ( k 3 + 1 ) × c ( w i , q ) k 3 + c ( w i , q ) ,
Just obtain document d about a query word w iRelevance score.
(3) document d is added up about the relevance score of all query words among the Q, just obtain the relevance score of the document d about inquiry Q.
4. calculate the subjective and objective property scoring of every piece of document
If the constant λ in the system is set at 0, the subjective and objective property scoring that then directly obtains document is 1.Otherwise according to the value (being designated as the value of W in the formula (2)) of the constant W that sets in the system, the subjective and objective property scoring of calculating every piece of document.Here be example with W value 25, illustrate that its flow process is as follows:
(1) total degree that occurs in document of all query words among the statistical query Q at first, note is done in the formula (2)
Figure C20081011866800132
Value.Write down the position l that these query words occur simultaneously in document j
(2) to each the emotion speech s in the tabulation of emotion speech i, carry out the operation of following flow process:
(2.1) add up this speech successively at each position l jThe number of times that occurs in the scope of each 25 speech of front and back, and, just obtain s among the document d with the number of times addition that all positions statistics obtain iWith the total degree that inquiry Q occurs simultaneously, note is done in the formula (2) Value.
(2.2), will obtain in the aforementioned flow process according to formula (2)
Figure C20081011866800134
The following formula of W substitution calculates, and obtains document d with respect to each emotion speech s iScoring:
log ( Σ w i ∈ q co ( s i , w i | W ) Σ w i ∈ q c ( w i , d ) × W + 1 ) ;
(3) scoring of all emotion speech is added up, multiply by again
Figure C20081011866800136
Coefficient, at last the result is added 1, just obtain the subjective and objective property scoring of document d.
5. calculate the final scoring of every piece of document
Relevance score with every piece of document obtains in flow process 3 multiplies each other with the subjective and objective property scoring that obtains in flow process 4, has just obtained every piece of final scoring in the viewpoint searching system.
6. all documents in the system are sorted, and the user is returned net result
According to the final scoring of every piece of document, in the document candidate collection all are sorted, according to the tabulation that bears results of scoring order from big to small, make that the final scoring of document is high more, then its position in the results list is forward more.At last this result is returned to the user, just finished complete information retrieval at the one query of user's input.
According to above step, just can make up the viewpoint information searching system of an effective computer automatic execution, taking all factors into consideration the correlativity and the subjective and objective property of document in system marks to document, make in the return results tabulation of viewpoint searching system, both the content with user inquiring was relevant for those, and the document that has stronger emotion again comes forward position, is more likely had access to earlier by the user, thereby improve the result for retrieval of viewpoint searching system, improve the retrieval performance of system.

Claims (2)

1. the methods of marking based on the information retrieval document of viewpoint searching is characterized in that, described method realizes in a machine information retrieval system successively according to the following steps:
Step (1). set up the emotion speech tabulation of Chinese and English contrast in the viewpoint searching server of described system, described emotion speech comprises an attribute in the following all properties at least: good, good, beautiful, big, bad, green bristlegrass and puppet;
Step (2). at least one query word w of user's input i, described system has arbitrary query word w in the user inquiring to all by information retrieval iDocument d pick out automatically, as candidate documents set, described candidate documents abbreviates document d as, and sets up document index, adds up all indexed total number of documents N, calculates the speech that is comprised in average every piece of document and counts avdl; Simultaneously, to each query word w of user i, add up this query word w that occurred in the described candidate documents set iThe sum of document d, with variable df (w i) expression;
Step (3). be calculated as follows each the query word w in each piece document d and the user inquiring iRelevance score ScoreI Rel(d, w i):
ScoreI rel ( d , w i ) = ln ( N - df ( w i ) + 0.5 df ( w i ) + 0.5 ) × ( k 1 + 1 ) × c ( w i , d ) k 1 ( 1 - b ) + b l d avdl + c ( w i , d ) × ( k 3 + 1 ) × c ( w i , q ) k 3 + c ( w i , q ) ,
Wherein: k 1Be the constant between 1.0~2.0,
C (w i, d) be speech w iThe total degree that in document d, occurs,
C (w i, q) be speech w iAppear at the total degree among the user inquiring Q,
B is the constant between 0.0~1.0,
l dBe the length of document d, show with the total speech numerical table among the document d,
k 3It is the integer constant between 0~1000;
Step (4). press following formula the relevance score addition of all query words in each piece document d and the user inquiring, obtain the relevance score of the document d about user inquiring Q, q is user's query word set:
ScoreI rel ( d , q ) = Σ w i ∈ q ScoreI rel ( d , w i ) ;
Step (5). be calculated as follows the subjective and objective property scoring ScoreI of every piece of document d Op(d, q), q is user's query word set:
Figure C2008101186680002C3
Wherein: λ is the constant between 0.0~1.0,
s i∈ d is each the emotion speech among the document d,
Co (s i, w i| W) be with user inquiring Q in arbitrary query word w iAppear at simultaneously among the document d and with query word w iPosition distance less than the emotion speech s of W iThe total degree that on all positions, occurs, W=25, unit is the speech number,
C (w i, d) be query word w iThe total degree that in document d, occurs;
Step (6). be calculated as follows the final scoring of each piece document, use ScoreI Rel(d, q) * ScoreI Op(d, q) expression;
Step (7). in described system, all candidate documents are sorted according to its corresponding final scoring order from big to small, obtain the results list behind the viewpoint searching and return to the user as final result for retrieval.
2. the methods of marking of a kind of information retrieval document based on viewpoint searching according to claim 1, it is characterized in that, also has one before to collecting the pre-treatment step of the raw information that has emotion that obtains in described step (1), comprising: remove stop words, document is carried out participle, and utilize the inverted entry technology to set up index to document content, use the removal stop words identical and the preprocess method of participle to carry out pre-service to the inquiry that the user submits to described document.
CNB2008101186680A 2008-08-22 2008-08-22 A kind of methods of marking of the information retrieval document based on viewpoint searching Active CN100570611C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2008101186680A CN100570611C (en) 2008-08-22 2008-08-22 A kind of methods of marking of the information retrieval document based on viewpoint searching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2008101186680A CN100570611C (en) 2008-08-22 2008-08-22 A kind of methods of marking of the information retrieval document based on viewpoint searching

Publications (2)

Publication Number Publication Date
CN101344890A CN101344890A (en) 2009-01-14
CN100570611C true CN100570611C (en) 2009-12-16

Family

ID=40246893

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2008101186680A Active CN100570611C (en) 2008-08-22 2008-08-22 A kind of methods of marking of the information retrieval document based on viewpoint searching

Country Status (1)

Country Link
CN (1) CN100570611C (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010181993A (en) * 2009-02-04 2010-08-19 Kddi Corp Evaluation analysis server, method, and program for evaluating text file containing pictorial symbol
CN102236650B (en) * 2010-04-20 2014-06-04 日电(中国)有限公司 Method and device for correcting and/or expanding sentiment dictionary
CN102567420B (en) * 2010-12-27 2014-03-12 北大方正集团有限公司 Document retrieval method and device
CN102567421B (en) * 2010-12-27 2014-04-02 北大方正集团有限公司 Document retrieval method and device
US10311113B2 (en) 2011-07-11 2019-06-04 Lexxe Pty Ltd. System and method of sentiment data use
US10198506B2 (en) * 2011-07-11 2019-02-05 Lexxe Pty Ltd. System and method of sentiment data generation
EP2570938A1 (en) * 2011-09-16 2013-03-20 Lexxe Pty Ltd. System and method for ordering semantic sub-keys utilizing superlative adjectives
CN102637165B (en) * 2012-02-17 2014-08-20 清华大学 Method for extracting attribute-viewpoint pairs of Chinese viewpoint and evaluation information
CN102929962B (en) * 2012-10-11 2015-08-12 中国科学技术大学 A kind of evaluating method of search engine
CN104424278B (en) * 2013-08-29 2019-02-26 腾讯科技(深圳)有限公司 A kind of method and device obtaining hot spot information
CN103646097B (en) * 2013-12-18 2016-09-07 北京理工大学 A kind of suggestion target based on restriction relation and emotion word associating clustering method
JP6165657B2 (en) * 2014-03-20 2017-07-19 株式会社東芝 Information processing apparatus, information processing method, and program
CN104268230B (en) * 2014-09-28 2017-09-15 福州大学 A kind of Chinese micro-blog viewpoint detection method based on heterogeneous figure random walk
CN104217026B (en) * 2014-09-28 2017-08-11 福州大学 A kind of Chinese micro-blog tendentiousness search method based on graph model
CN106156340A (en) * 2016-07-12 2016-11-23 浪潮(北京)电子信息产业有限公司 A kind of name entity link method
CN106407730A (en) * 2016-11-30 2017-02-15 广州市万表科技股份有限公司 Watch evaluation method and device
CN108038204A (en) * 2017-12-15 2018-05-15 福州大学 For the viewpoint searching system and method for social media
US11314930B2 (en) 2019-02-11 2022-04-26 Google Llc Generating and provisioning of additional content for source perspective(s) of a document
CN110209821A (en) * 2019-06-06 2019-09-06 北京奇艺世纪科技有限公司 Text categories determine method and apparatus
CN111488931B (en) * 2020-04-10 2023-04-07 腾讯科技(深圳)有限公司 Article quality evaluation method, article recommendation method and corresponding devices
CN111612658B (en) * 2020-05-29 2022-03-01 北京华宇元典信息服务有限公司 Evaluation method and evaluation device for legal data retrieval and electronic equipment

Also Published As

Publication number Publication date
CN101344890A (en) 2009-01-14

Similar Documents

Publication Publication Date Title
CN100570611C (en) A kind of methods of marking of the information retrieval document based on viewpoint searching
TWI544351B (en) Extended query method and system
CN102760138B (en) Classification method and device for user network behaviors and search method and device for user network behaviors
CN101501630B (en) Method for ranking computerized search result list and its database search engine
Li et al. Tag-based social interest discovery
CN100465954C (en) Reinforced clustering of multi-type data objects for search term suggestion
US8612435B2 (en) Activity based users' interests modeling for determining content relevance
CN102446180B (en) A kind of product search method and device thereof
CN1818908A (en) Feedbakc information use of searcher in search engine
CN105653562B (en) The calculation method and device of correlation between a kind of content of text and inquiry request
CN102314443B (en) The modification method of search engine and system
US7822752B2 (en) Efficient retrieval algorithm by query term discrimination
JP2009520264A5 (en)
WO2005083593A1 (en) A method for providing search results list based on importance information and system thereof
CN103020164A (en) Semantic search method based on multi-semantic analysis and personalized sequencing
CN101206674A (en) Enhancement type related search system and method using commercial articles as medium
US20100306214A1 (en) Identifying modifiers in web queries over structured data
CN101719145A (en) Individuation searching method based on book domain ontology
CN105426514A (en) Personalized mobile APP recommendation method
CN103123653A (en) Search engine retrieving ordering method based on Bayesian classification learning
CN104252456A (en) Method, device and system for weight estimation
CN107193883B (en) Data processing method and system
CN109815401A (en) A kind of name disambiguation method applied to Web people search
JP2006318398A (en) Vector generation method and device, information classifying method and device, and program, and computer readable storage medium with program stored therein
CN110609950B (en) Public opinion system search word recommendation method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20090626

Address after: Beijing 100084-82 mailbox code: 100084

Applicant after: Tsinghua University

Co-applicant after: Sogo Science-Technology Development Co., Ltd., Beijing

Address before: Beijing 100084-82 mailbox code: 100084

Applicant before: Tsinghua University

C14 Grant of patent or utility model
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Zhang Min

Inventor after: Sun Maosong

Inventor after: Ma Shaoping

Inventor after: Hong Richang

Inventor after: Ru Liyun

Inventor after: Tong Zijian

Inventor before: Zhang Min

Inventor before: Ma Shaoping

Inventor before: Ru Liyun

Inventor before: Tong Zijian

COR Change of bibliographic data