CN100570611C

CN100570611C - A kind of methods of marking of the information retrieval document based on viewpoint searching

Info

Publication number: CN100570611C
Application number: CNB2008101186680A
Authority: CN
Inventors: 张敏; 马少平; 茹立云; 佟子健
Original assignee: Tsinghua University; Beijing Sogou Technology Development Co Ltd
Current assignee: Tsinghua University; Beijing Sogou Technology Development Co Ltd
Priority date: 2008-08-22
Filing date: 2008-08-22
Publication date: 2009-12-16
Anticipated expiration: 2028-08-22
Also published as: CN101344890A

Abstract

A kind of methods of marking of the information retrieval document based on viewpoint searching belongs to field of information processing.It is characterized in that: it at first sets up the tabulation of emotion speech, and all that appointment will be used in searching system in tabulation have the speech of emotion, then according to the inquiry of the concrete input of user, generates the candidate result set; Secondly in system, calculate the correlativity between document and the user inquiring, obtain the relevance score of every piece of document; According to the number of times of the emotion speech that occurs jointly in the certain distance scope with query word in the document, calculate the subjective and objective property scoring of every piece of document in the system then; The relevance score of one piece of document and the scoring of subjective and objective property are carried out the merging based on quadratic function (promptly multiplying each other), the document after obtaining merging is finally marked again; Last searching system sorts to all candidate documents in the system according to the final scoring of document, and is shown to the user according to scoring order from big to small.This technology has computing machine to be finished automatically, can return the advantage that not only has high correlation but also have the result for retrieval of strong subjective suggestion.

Description

A kind of methods of marking of the information retrieval document based on viewpoint searching

Technical field

The invention belongs to field of information processing, particularly relate to information retrieval system, is the method for in information retrieval system document being marked specifically, finally obtains relevant with user inquiring and result for retrieval that have subjective suggestion.

Background technology

Information retrieval system is with the certain strategy acquisition of information (web document on the internet for example, or the digital document in the digital library etc.), the computer system that retrieval service afterwards is provided for the user, two parts of software program that it comprises computer hardware system and moves are organized and handled to information on hardware system.Its main effect is to help the user to obtain the useful information that those can be met consumers' demand fast, efficiently.

Information retrieval system is undertaken by querying server and user alternately.Querying server provides the page that can import or select the query demand of oneself for the user on the one hand, and query demand is made of one or several keyword usually.The inquiry submitted at the user of querying server is on the other hand retrieved in all document in system, returns the correlated results tabulation after the ordering, and show the user in order in results page.Be exactly relevance score between document and the user inquiring to the sort by of document in the system wherein, it is high more to mark, and it is forward more to sort.Therefore a gordian technique in the querying server is how to calculate the correlativity of document, obtain supplying the relevance score of ordering, thereby the page that makes the user want comes the preceding column position of return results, easier accessed by the user the arriving of relevant information.

Along with the development of information retrieval system, people also begin to have proposed more more complicated query demands, and wherein a kind of important need is viewpoint searching (opinion search).In viewpoint searching, system not only will find out the information relevant with user inquiring, and these information must have certain emotion, and promptly searching system will be returned relevant and information that have the subjectivity suggestion.And those relevant informations that only are objectivity is described also are considered to not meet user's request.For example a user wants to buy a mobile phone, his input inquiry " N95 mobile phone " in searching system carries out viewpoint searching so, wish that the content of returning is about the various performances of N95 and the suggestion on the function in other users or the comment, for example whether the price of mobile phone is cheap, whether outward appearance is good-looking, whether battery is durable, and whether overall assessment is good, or the like.Under such requirement, querying server not only will calculate the relevance score of each document, also will calculate its subjective and objective property scoring, then these two kinds of scorings is merged together, and forms the final score of one piece of document, and returns to the user after the ordering.

Since middle 1960s, people have proposed the model of the correlativity of a lot of calculating documents and user inquiring.Its main thought is exactly " TF*IDF ", promptly one side is considered the number of times that user's query word occurs in one piece of document, and occurrence number is many more might be correlated with more, considers the common degree of this query word in whole documents on the other hand, common more, its separating capacity is weak more.Current application model widely mainly contains boolean's model (Boolean Model), statistical model (Statistical Model) and linguistry model (Linguistic and Knowledge-based Model).

Aspect the subjective and objective property scoring of calculating document, a kind of method commonly used is to see the speech (being commonly referred to the emotion speech) that has occurred how much having the subjectivity emotion in this piece document.The emotion speech that occurs is many more, and then the subjective and objective property scoring of document is high more.Wherein the emotion speech is provided by the emotion speech tabulation that an artificial or automatic method builds in advance usually.Another kind method is the technology by text classification, and one piece of document is divided into subjective class or objective class, belongs to the degree difference of certain class according to the document, obtains the subjective and objective property scoring of the document.

In viewpoint searching, how scoring of subjective and objective property and relevance score are merged, be a key factor that influences the viewpoint searching system performance.Owing to lack further investigation, the contact between these two scorings is not set up in the present correlation technique, method commonly used is addition simply to be carried out in these two kinds of scorings in certain proportion (for example A is multiply by in the objectivity scoring, relevance score be multiply by B, and the two is added up.A and B are prior given numerical value).Many experimental results show that this method can not satisfy user's viewpoint searching demand well.Even under many circumstances, merging result afterwards through this correlativity and subjective and objective property scoring, its performance is not as good as the effect that offers user's result after only using relevance score to sort.Therefore, be necessary to propose a kind of method that merges correlativity and the scoring of subjective and objective property effectively,, satisfy the demand that the user carries out viewpoint searching to improve the performance of information retrieval system.

Summary of the invention

The viewpoint searching system will return relevant with user inquiring and information that have the subjectivity suggestion according to user's request.The present invention proposes a kind of methods of marking of not only having considered the subjective and objective property of document but also having considered the correlativity of document.Be different from traditional weighted linear merging method and (promptly two scorings multiply by the weights of a constant respectively, mode with addition merges then), in the merging method that the present invention proposes, taken all factors into consideration the influence of correlativity, relevance score and the subjective and objective scoring relation with quadratic function (promptly multiplying each other) has been merged subjective and objective property.Wherein consider the value difference of two kinds of scorings especially, with subjective and objective scoring carried out the taking the logarithm normalized of (log).By this merging method, return in user's the result document after the final ordering, the forward document of those orderings had both had very high correlativity, has very strong subjectivity again, thereby improve the performance of viewpoint searching system effectively, make things convenient for the user in the results page that early occurs, just can carry out the visit of information, therefore more can satisfy the demand of User Perspective retrieval.

The particular content of this method is as follows:

1. at first set up the tabulation of emotion speech, all that appointment will be used in searching system in tabulation have the speech of emotion, input to the inquiry of system then at the user, and system finds out the candidate documents set automatically;

2. in searching system, calculate the degree of relevancy between the inquiry that every piece of document and user import, obtain the relevance score of document;

3. according to the number of the speech that has emotion (being the emotion speech) that occurs jointly in set distance range with query word in one piece of document, calculate the subjective and objective scoring of document;

4. the relevance score of one piece of document and subjective and objective scoring are carried out the merging based on quadratic function (promptly multiplying each other), the document after obtaining merging is finally marked;

5. searching system sorts to all candidate documents in the system according to final scoring, and is shown to the user according to scoring order from big to small.

The invention is characterized in:

Described method realizes in a machine information retrieval system successively according to the following steps:

Step (1). set up the emotion speech tabulation of Chinese and English contrast in the viewpoint searching server of described system, described emotion speech comprises an attribute in the following all properties at least: good, good, beautiful, big, bad, green bristlegrass and puppet;

Step (2). at least one query word w of user's input _i, described system has arbitrary query word w in the user inquiring to all by information retrieval _iDocument d pick out automatically, as candidate documents set, described candidate documents abbreviates document d as, and sets up document index, adds up all indexed total number of documents N, calculates the speech that is comprised in average every piece of document and counts avdl; Simultaneously, to each query word w of user _i, add up this query word w that occurred in the described candidate documents set _iThe sum of document d, with variable df (w _i) expression;

Step (3). be calculated as follows each the query word w in each piece document d and the user inquiring _iRelevance score ScoreI _Rel(d, w _i):

{ScoreI}_{rel} (d, w_{i}) = \ln (\frac{N - df (w_{i}) + 0.5}{df (w_{i}) + 0.5}) \times \frac{(k_{1} + 1) \times c (w_{i}, d)}{k_{1} (1 - b) + b \frac{l_{d}}{avdl} + c (w_{i}, d)} \times \frac{(k_{3} + 1) \times c (w_{i}, q)}{k_{3} + c (w_{i}, q)},

Wherein: k ₁Be the constant between 1.0～2.0,

C (w _i, d) be speech w _iThe total degree that in document d, occurs,

C (w _i, q) be speech w _iAppear at the total degree among the user inquiring Q,

B is the constant between 0.0～1.0,

l _dBe the length of document d, show with the total speech numerical table among the document d,

k ₃It is the integer constant between 0～1000;

Step (4). press following formula the relevance score addition of all query words in each piece document d and the user inquiring, obtain the relevance score of the document d about user inquiring Q, q is user's query word set:

{ScoreI}_{rel} (d, q) = \underset{w_{i} &Element; q}{Σ} S {coreI}_{rel} (d, w_{i});

Step (5). be calculated as follows the subjective and objective property scoring ScoreI of every piece of document d _Op(d, q), q is user's query word set:

If λ ≠ 0,

Wherein: λ is the constant between 0.0～1.0,

s _i∈ d is each the emotion speech among the document d,

Co (s _i, w _i| W) be with user inquiring Q in arbitrary query word w _iAppear at simultaneously among the document d and with query word w _iPosition distance less than the emotion speech s of W _iThe total degree that on all positions, occurs, W=25, unit is the speech number, c (w _i, d) be query word w _iThe total degree that in document d, occurs;

Step (6). be calculated as follows the final scoring of each piece document, use ScoreI _Rel(d, q) * ScoreI _Op(d, q) expression;

Step (7). in described system, all candidate documents are sorted according to its corresponding final scoring order from big to small, obtain the results list behind the viewpoint searching and return to the user as final result for retrieval.

The present invention can take into full account getting in touch between the relevance score of the subjective and objective scoring of document and document, the mode of these two kinds of scorings with quadratic function (promptly multiplying each other) merged, also consider the difference of these two kinds of scorings simultaneously, therefore the processing of taking the logarithm, the results list of the viewpoint searching system that finally is optimized have been carried out in the subjective and objective scoring of document.Thereby, those not only can have been had emotion but also come the front with the closely-related document of the content of user inquiring and returned to the user at user's viewpoint searching demand.This method is handled simple, algorithm complex is low, has obtained good result on test data, has improved the performance that search engine carries out viewpoint searching greatly.This explanation the present invention has generalization and adaptability preferably, can the viewpoint searching result of search engine effectively be improved, and has a good application prospect.

Description of drawings

Fig. 1. information retrieval system is carried out the basic procedure framework of viewpoint searching.

Fig. 2. the viewpoint searching methods of marking flow process that the present invention proposes.

Embodiment

It is finished on computers automatically, contains successively to have the following steps:

Step 1 generates tabulation of emotion speech and candidate documents set

Emotion speech tabulation comprise that system will handle all have the speech of emotion, as " good " of Chinese, " bad ", " disappointing " etc., English comprises " good ", " bad " etc.Here to knowing the speech among the net HowNet, carry out automatic screening according to its attribute, if the attribute definition of a speech in knowing net comprises " good| is good " at least, " desire| is good ", " beautiful| U.S. ", " great| is big ", " bad| is bad ", " undesired| green bristlegrass ", in " fake| puppet " one, then the English descriptor of this speech and this speech correspondence is picked out, added Chinese and the tabulation of English emotion speech respectively.

At the one query (may comprise a plurality of query words) of user input, searching system has in the user inquiring document of arbitrary query word with all to be picked out automatically, gathers as candidate documents.Later operation is all carried out in the scope of this candidate documents set, and other document has not just all been considered in current user inquiring.

Step 2 is calculated the relevance score of one piece of document and user inquiring

Can use correlation calculations method commonly used in the existing information searching system to obtain the relevance score of each piece candidate documents and inquiry.Use formula as follows:

{ScoreI}_{rel} (d, q) = Σ_{w_{i} &Element; q} [\ln (\frac{N - df (w_{i}) + 0.5}{df (w_{i}) + 0.5}) \times \frac{(k_{1} + 1) \times c (w_{i}, d)}{k_{1} (1 - b) + b \frac{l_{d}}{avdl} + c (w_{i}, d)} \times \frac{(k_{3} + 1) \times c (w_{i}, q)}{k_{3} + c (w_{i}, q)}]

(formula 1)

Wherein: d is the document after step 1 is handled; Q is the user inquiring after step 1 is handled; ScoreI _Rel(d q) is document d and the relevance score of inquiring about q; w _i∈ q is each speech in the user inquiring; Ln () is for going the natural logarithm operation to the content in the bracket; N is the whole total number of documents in the system; Df (w _i) be that all comprise speech w in the system _iTotal number of documents; k ₁It is the constant between 1.0 to 2.0; C (w _i, d) be speech w _iThe total degree that in document d, occurs; B is the constant between 0.0 to 1.0; l _dBe the length (being the total speech number among the document d) of document d; Avdl is the average length of all documents in the system; k ₃It is the integer constant between 0 to 1000; C (w _i, be that speech appears at the number of times among the inquiry q q).In these parameters, remove k ₁, b and k ₃Be beyond the artificial constant of setting, other each values can obtain according to user inquiring and the automatic statistics of collection of document.

Step 3 is calculated the subjective and objective scoring of one piece of document

Use the subjective and objective property scoring of every piece of candidate documents in the following formula computing system:

If λ ≠ 0, (formula 2)

Wherein: d is the document after step 1 is handled; Q is the user inquiring after step 1 is handled; ScoreI _Op(d q) is the subjective and objective scoring of document d with respect to inquiry q; λ is the constant between 0.0 to 1.0; s _i∈ d is each the emotion speech (wherein the scope of emotion speech is given by the tabulation of the emotion speech in the step 3.1) among the document d; Log () is operation that the content in the bracket is taken the logarithm; Co (s _i, w _i| W) be arbitrary query word w with user inquiring Q _iAppear at simultaneously among the document d and with query word w _iPosition distance less than the emotion speech s of W _iThe total degree that on all positions, occurs; W is the size of distance (being also referred to as window), is a positive integer; C (w _i, d) be query word w _iThe total degree that in document d, occurs.In these parameters, except that λ and W were constant by artificial setting, other each values can be added up automatically according to user inquiring, the tabulation of emotion speech and collection of document and be obtained.

Step 4 is calculated the final scoring of one piece of document

Step 2 and resulting relevance score of step 4 and subjective and objective scoring are multiplied each other, that is:

ScoreI _Rel(d, q) * ScoreI _Op(d, q), (formula 3)

Just can obtain that one piece of candidate documents d finally marks with respect to the viewpoint searching of user inquiring Q in the system.

Step 5 obtains final viewpoint searching the results list

In searching system, all candidate documents are sorted according to its corresponding final scoring order from big to small, just obtained the results list behind the viewpoint searching, and returned to the user as final result for retrieval.

In order to verify validity of the present invention, reliability and application, we design and have tested relevant confirmatory experiment.

On data source, we have used the text retrieval meeting TREC given standard testing data of the NIST of American National technical institute tissue: the English blog web page data acquisition in internet, 100 user inquirings and each are inquired about corresponding answer set (being obtained by the artificial mark of NIST tissue).

In confirmatory experiment, use average retrieval precision (MAP) commonly used in the information retrieval to carry out the evaluation of performance.

Carry out the confirmatory experiment of viewpoint searching according to step of the present invention on above-mentioned data acquisition, table 1 has been listed and has been used methods of marking of the present invention, compares with the method for linear weighted function commonly used now, and the searching system performance of bringing improves situation.Linear weighted function method wherein commonly used now is:

{λScoreI}_{op} (d, q) + (1 - λ) \frac{\underset{w_{i} &Element; q}{Σ} co (s_{i}, w_{i} | W)}{\underset{w_{i} &Element; q}{Σ} c (w_{i}, d) \times W},

Relevant parameter in the method for the implication of its each several part parameter and computing method and formula 1 of the present invention and formula 2 is identical.The parameter of the artificial setting of using among the present invention value respectively is: k ₁=1.0, b=0.75, k ₃=100, W=25.The value of λ is as shown in table 1.The amplitude that can see the performance raising is bigger, all more than 8%, is up to 18.6%.

Table 1 methods of marking of the present invention is compared the searching system performance of bringing and is improved with linear weighted function method commonly used now

The value of λ	The retrieval performance of linear weighted function method commonly used	The retrieval performance of the methods of marking that the present invention provides	The performance that the present invention brings improves
The value of λ			The performance that the present invention brings improves	0.01	0.1969	0.2253	14.42％
0.1	0.2041	0.2255	10.49％	0.01	0.1969	0.2253	14.42％
0.1	0.2041	0.2255	10.49％	0.2	0.2071	0.2256	8.93％
0.3	0.2081	0.2257	8.46％	0.2	0.2071	0.2256	8.93％
0.3	0.2081	0.2257	8.46％	0.4	0.2087	0.2257	8.15％
0.5	0.2067	0.2259	9.29％	0.4	0.2087	0.2257	8.15％
0.5	0.2067	0.2259	9.29％	0.6	0.2038	0.2266	11.19％
0.7	0.1993	0.2267	13.75％	0.6	0.2038	0.2266	11.19％
0.7	0.1993	0.2267	13.75％	0.8	0.1938	0.2255	16.36％
0.9	0.1866	0.2213	18.60％	0.8	0.1938	0.2255	16.36％

For example to user inquiring " Oprah " (Oprah Winfrey is the host of a talk show TV programme of the U.S.), the comment and the viewpoint information of her program found in hope.But the classic method of using, what find is a large amount of the rendition lists and objectively content introduction and propaganda, therefore the MAP precision of this inquiry is had only 0.0687, has only two to have viewpoint information among preceding 10 results that searching system is returned; And adopt methods of marking proposed by the invention to retrieve, then improved searching system MAP precision has brought up to 0.2721, and among preceding 10 results that return, have 8 all to be viewpoint and the review information that the user needs, system performance is 4 times of classic method system performance.

User inquiring " tivo " (being a digital VTR) for example wishes to find everybody evaluation to this brand again.Use classic method, found the product introduction of a large amount of these brands, comprised the objective information of numeric types such as many models, volume, but the information of the good or bad viewpoint of this brand is not provided, therefore in preceding 10 return results to this inquiry, have only 1 to be the information that the user needs; And the method for using the present invention to propose, then come the foremost with those other users' use gains in depth of comprehension, impression and to information documents such as product evaluation viewpoints, among preceding 10 results that the system that makes returns, there are 9 all to be the relevant information that has viewpoint that the user needs, greatly improved the satisfaction that the user uses.

Accompanying drawing 1 has been described the basic procedure framework that information retrieval system is carried out viewpoint searching:

1, at first will collect the raw data that obtains and carry out pre-service, comprise: remove speech (being called stop words) too commonly used in the document, for example " ", " ", " getting ", " " etc.The stop words tabulation can freely be formulated as required; Use participle technique (the longest coupling of forward is divided word algorithm) that document is carried out participle to Chinese content, with the base unit of speech (comprising monosyllabic word) as document, all interior words of speech are all regarded an integral body as and are handled.

2, utilize the inverted entry technology to set up index to the content of document then.

3, the inquiry that the user is submitted to uses the pre-service identical with document (removing stop words, participle) method to inquire about pre-service.

4, at last according to the proposed method, utilize the emotion speech tabulation of having set up, in the viewpoint searching server, inquiry after handling and the document behind the index are mated, every piece of document is obtained the viewpoint searching scoring, form results page after sorting, return to the user.

The present invention is exactly the 4th step at above-mentioned flow process, promptly how to set up emotion speech tabulation, and how document is marked in the viewpoint searching server and propose corresponding method.Accompanying drawing 2 has been described flow process of the present invention.Carry out detailed process description with regard to how in the viewpoint searching system, implementing the present invention below.

1. according to the one query content of user's submission, system finds out the candidate documents set

Document in the system is carried out pre-service, comprise the removal stop words, Chinese document carries out participle, and sets up document index.Simultaneously the user is submitted to the query contents of system, also carry out on all four preprocessing process.

All indexed total number of documents in the statistical system at first just obtain the value of the variable N in the formula (1).The speech number that average every piece of document comprises in the computing system (total getting final product divided by the total number of documents in the system with the speech number of all documents in the system) is be exactly the value of variable avdl in the formula (1).

After having submitted inquiry Q to, the user (comprises one or more speech in the inquiry), to (also comprise one or more speech in the document) in all documents that index is good, those documents that occurred a query word among the user inquiring Q are at least picked out, and constitute the candidate documents set.Remaining those do not comprise document all no longer considerations in this retrieving of any query word.To each the speech w among the inquiry Q _i, the sum of the document of this speech that occurred is designated as variable df (w in the formula (1) in the set of statistics candidate documents _i) value.

The value of each constant of using in the initialization system for example can be made as: k ₁=1.0, b=0.75, k ₃=100, W=25, λ=0.8.

2. generate the tabulation of emotion speech according to following flow process.

Initialization Chinese emotion dictionary tabulation S _CN={ }; The English emotion dictionary tabulation of initialization S _EN={ }; To knowing each speech in the net (HowNet): one of following if the value of its attribute (DEF) comprises at least: " good| is good ", " desire| is good ", " beautiful| U.S. ", " great| is big " so: the Chinese of this speech is described (value of W_C) adds Chinese emotion dictionary tabulation S _CNAs a record; The English of this speech is described (value of W_X) add English emotion dictionary S _ENAs a record; If it is one of following that the value of its attribute (DEF) comprises at least: " bad| is bad ", " undesired| green bristlegrass ", " fake| puppet " so: the Chinese of this speech is described (value of W_C) adds Chinese emotion dictionary tabulation S _CNAs a record; The English of this speech is described (value of W_X) add English emotion dictionary S _ENAs a record; To English emotion dictionary S _ENIn each bar record: if should be made up of so a plurality of speech by record: each speech in will writing down also writes down adding emotion dictionary S as one respectively _ENLeave out English emotion dictionary S _ENIn duplicate record.

In the tabulation that obtains according to this flow process, wherein Chinese word has, and English emotion speech has 4621.

3. calculate the relevance score of every piece of document

Every piece of document d in the candidate documents set is carried out following operation:

(1) total speech number of statistics document d, note is made variable l in the formula (1) _dValue; To each the speech w among the inquiry Q _i, add up the number of times that this speech occurs in document d, be designated as variable c (w in the formula (1) _i, value d); Add up the number of times that this speech occurs in inquiry Q, note is made variable c (w in the formula (1) _i, value q);

(2) with N, the avdl, the df (w that have obtained in the above-mentioned steps _i), k ₁, b, k ₃, l _d, c (w _i, d), c (w _i, value substitution formula (1) q) is carried out following calculating:

\ln (\frac{N - df (w_{i}) + 0.5}{df (w_{i}) + 0.5}) \times \frac{(k_{1} + 1) \times c (w_{i}, d)}{k_{1} (1 - b) + b \frac{l_{d}}{avdl} + c (w_{i}, d)} \times \frac{(k_{3} + 1) \times c (w_{i}, q)}{k_{3} + c (w_{i}, q)},

Just obtain document d about a query word w _iRelevance score.

(3) document d is added up about the relevance score of all query words among the Q, just obtain the relevance score of the document d about inquiry Q.

4. calculate the subjective and objective property scoring of every piece of document

If the constant λ in the system is set at 0, the subjective and objective property scoring that then directly obtains document is 1.Otherwise according to the value (being designated as the value of W in the formula (2)) of the constant W that sets in the system, the subjective and objective property scoring of calculating every piece of document.Here be example with W value 25, illustrate that its flow process is as follows:

(1) total degree that occurs in document of all query words among the statistical query Q at first, note is done in the formula (2)

Value.Write down the position l that these query words occur simultaneously in document _j

(2) to each the emotion speech s in the tabulation of emotion speech _i, carry out the operation of following flow process:

(2.1) add up this speech successively at each position l _jThe number of times that occurs in the scope of each 25 speech of front and back, and, just obtain s among the document d with the number of times addition that all positions statistics obtain _iWith the total degree that inquiry Q occurs simultaneously, note is done in the formula (2) Value.

(2.2), will obtain in the aforementioned flow process according to formula (2)

The following formula of W substitution calculates, and obtains document d with respect to each emotion speech s _iScoring:

\log (\frac{\underset{w_{i} &Element; q}{Σ} co (s_{i}, w_{i} | W)}{\underset{w_{i} &Element; q}{Σ} c (w_{i}, d) \times W} + 1);

(3) scoring of all emotion speech is added up, multiply by again

Coefficient, at last the result is added 1, just obtain the subjective and objective property scoring of document d.

5. calculate the final scoring of every piece of document

Relevance score with every piece of document obtains in flow process 3 multiplies each other with the subjective and objective property scoring that obtains in flow process 4, has just obtained every piece of final scoring in the viewpoint searching system.

6. all documents in the system are sorted, and the user is returned net result

According to the final scoring of every piece of document, in the document candidate collection all are sorted, according to the tabulation that bears results of scoring order from big to small, make that the final scoring of document is high more, then its position in the results list is forward more.At last this result is returned to the user, just finished complete information retrieval at the one query of user's input.

According to above step, just can make up the viewpoint information searching system of an effective computer automatic execution, taking all factors into consideration the correlativity and the subjective and objective property of document in system marks to document, make in the return results tabulation of viewpoint searching system, both the content with user inquiring was relevant for those, and the document that has stronger emotion again comes forward position, is more likely had access to earlier by the user, thereby improve the result for retrieval of viewpoint searching system, improve the retrieval performance of system.

Claims

1. the methods of marking based on the information retrieval document of viewpoint searching is characterized in that, described method realizes in a machine information retrieval system successively according to the following steps:

{ScoreI}_{rel} (d, w_{i}) = \ln (\frac{N - df (w_{i}) + 0.5}{df (w_{i}) + 0.5}) \times \frac{(k_{1} + 1) \times c (w_{i}, d)}{k_{1} (1 - b) + b \frac{l_{d}}{avdl} + c (w_{i}, d)} \times \frac{(k_{3} + 1) \times c (w_{i}, q)}{k_{3} + c (w_{i}, q)},

Wherein: k ₁Be the constant between 1.0～2.0,

C (w _i, d) be speech w _iThe total degree that in document d, occurs,

B is the constant between 0.0～1.0,

k ₃It is the integer constant between 0～1000;

{ScoreI}_{rel} (d, q) = \underset{w_{i} &Element; q}{Σ} {ScoreI}_{rel} (d, w_{i});

Wherein: λ is the constant between 0.0～1.0,

s _i∈ d is each the emotion speech among the document d,

Co (s _i, w _i| W) be with user inquiring Q in arbitrary query word w _iAppear at simultaneously among the document d and with query word w _iPosition distance less than the emotion speech s of W _iThe total degree that on all positions, occurs, W=25, unit is the speech number,

C (w _i, d) be query word w _iThe total degree that in document d, occurs;

2. the methods of marking of a kind of information retrieval document based on viewpoint searching according to claim 1, it is characterized in that, also has one before to collecting the pre-treatment step of the raw information that has emotion that obtains in described step (1), comprising: remove stop words, document is carried out participle, and utilize the inverted entry technology to set up index to document content, use the removal stop words identical and the preprocess method of participle to carry out pre-service to the inquiry that the user submits to described document.